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Preface 


Writing a science book on recently advanced topics was not an easy journey, but it 
was worth all the time and challenges since we know that it will be used to serve justice 
to those who committed crimes and help to sustain peace with the power of science. This 
book was made in a continuous work over more than 2 years (since 2021), starting from 
writing a proposal, gathering the best authors, and folowing up with authors with dead- 
lines. The hardest job was scheduling the perfect timing between different editors living 
in different sides of the world with extreme various time zones; when it was too early for 
one, it was too late for the other one. 

Next Generation Sequencing (NGS) Technology in DNA Analysis explains and summa- 
rizes NGS technological applications in the field of forensic DNA analysis. This book 
covers the shift from capillary electrophoresis (CE) to NGS platforms and the fundamen- 
tals of NGS technologies, applications, and advancements. Sections provide an overview 
of NGS technology and forensic science, including information on processing biological 
samples for forensic analysis, sequence analysis and data analysis software for the analysis of 
NGS data. Next, the book explores the useful applications of NGS based forensic DNA 
analysis and covers the validations and interpretation guidelines of NGS workflows. 
Though most of the forensic science laboratories are in the process of migrating from 
CE based techniques to NGS based techniques, the limitations and future challenges 
of NGS technology in forensic science has also been explained in the book. 

With chapter contributions from an international array of experts and the inclusion of 
practical case studies, this book is the most useful reference for academics and researchers 
in genetics, biotechnology, bioinformatics, biology, and medicine as well as forensic 
DNA scientists and practitioners who aim to learn, use, apply, and validate NGS-based 
technologies. We hope that this book will shed light to the scientists and academics 
that seek knowledge and offer them the answers they pursue. We look forward to con- 
versations with our readers. 

Happy Reading!!! 

Hirak Ranjan Dash 
Kelly M. Elkins 

Noora Rashid Al-Snan 
May 2023 
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CHAPTER 1 


Transition of capillary electrophoresis 
to next generation sequencing for 
forensic DNA analysis: Need of the 
hour 


Noora Rashid Al-Snan 


Forensic Science Laboratory, Directorate of Forensic Science, General Directorate of Criminal Investigation and Forensic 
Science, Ministry of Interior, Manama, Bahrain 


Introduction 


It has been more than a decade since the first next generation sequencing (NGS) tech- 
nology appeared, dramatically altering the way genetic research is conducted and ush- 
ering in a new era of high-throughput genomic analysis. The terms NGS, massively 
parallel sequencing (MPS), and deep sequencing are all used to describe a DNA 
sequencing method that has revolutionized genomic research. The full human 
genome can be sequenced in a single day using NGS (Behjati & Tarpey, 2013). 
Today, full genomes are mapped and published almost weekly, at an increasing rate 
and at a lower cost (Borsting & Morling, 2015). NGS methods and platforms have 
matured over the last decade, and the quality of the sequences has reached the point 
where NGS is used in human clinical diagnostics. Forensic genetic laboratories have 
also investigated NGS technology, and there has been a small increase in the number 
of scientific articles and conference presentations dealing with forensic aspects of NGS 
in recent years. These contributions have been assessed as offering new possibilities for 
forensic genetic casework. More information can be obtained from processing the 
samples in a single experiment by applying combinations of markers (short tandem re- 
peats (STRs), single nucleotide polymorphisms (SNPs), insertions/deletions, and 
messenger RNA (mRNA)) that cannot be analyzed simultaneously with today’s stan- 
dard polymerase chain reaction-capillary electrophoresis (PCR-CE) methods 
(Borsting & Morling, 2015). 

DNA sequencing has a long history in forensic genetics. In the early 1990s, STRs 
were introduced as polymorphic DNA loci in the forensic field (Puers et al., 1993) 
and have become the primary workhorse for individual identification in criminal cases, 
paternity analyses, and missing person identification (Gill et al., 1997). Two seminal ar- 
ticles describing methods for DNA sequencing were published in 1977. Allan Maxam 
and Walter Gilbert described a method for separating base-specific chemical cleavage 
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products from terminally labeled DNA fragments using gel electrophoresis (Maxam & 
Gilbert, 1977; Voelkerding et al., 2009). The Sanger dideoxynucleotide (ddNTP) chain 
terminating method was used for sequencing, in which the addition of a ddNTP to a 
growing DNA chain prevented further extension by the DNA polymerase (Sanger 
et al., 1977). This method was considered the gold standard for nucleic acid sequencing 
for the subsequent two and a half decades (Grada & Weinbrecht, 2013). Initially, the syn- 
thesized DNA fragments were separated by slab gel electrophoresis and detected by 
incorporating radioactively or fluorescently labeled deoxynucleotides (ANTPs) into the 
DNA fragments. Following that, fluorescently labeled ddNTPs and CE platforms 
were introduced, with increased sensitivity and throughput and reduced the cost of 
Sanger sequencing to the point where complete genome sequencing became feasible 
(Mitchelson, 2003). Sanger technology was used in an industrial, high-throughput 
configuration to sequence the first human genome, which was completed in 2003 as 
part of the Human Genome Project, a 13-year effort with an estimated cost of $2.7 
billion (Wheeler et al., 2008). 

Specific guidelines were adapted to select the proper STR loci used in forensic DNA 
analysis. Thus, core loci were defined, with significant overlap between international 
legislation (Welch et al., 2012). PCR-based amplicon sizing methods and CE systems 
were used to identify allele categories (Gill et al., 1997) using a simple nomenclature 
convention (Butler, 2006). However, the Sanger sequencing method was used continu- 
ously for verification and identification of, e.g., STR alleles (see references in STRbase, 
http://www.cstl.nist.gov/strbase/). 

There is a continuous need to use CE systems. However, in terms of captured infor- 
mation, multiplex sizes, and analyzing highly degraded samples, NGS is adding a new 
dimension to the field of forensic genetics and offering distinct advantages over CE sys- 
tems (Borsting & Morling, 2015). The biggest advantage is that STR-generated data in 
NGS is also compatible with any CE system-based database (Scheible et al., 2014). Addi- 
tionally, NGS-derived STR genotypes supplement CE-derived genotypes by capturing 
the full nucleotide sequence underlying the repeat units and nearby flanking regions. 
Forensic tests using NGS will be able to distinguish STR variants that cannot be distin- 
guished by mass spectrometry, such as repeat motifs that are shifted relative to each other 
in the repeat region (Pitter! et al., 2008). NGS STR typing demonstrates that it will be 
extremely useful in routine casework by increasing discrimination power, improving 
mixture resolution, and improving the identification of stutter peaks and artifacts (Get- 
tings et al., 2015). The true variation in core forensic STR loci has been discovered, as 
have previously unknown STR alleles (Borsting & Morling, 2015). Although NGS 
has largely replaced conventional Sanger sequencing in genomic research, it has not 
yet made its way into normal practice (Behjati & Tarpey, 2013). Also, NGS STR analysis 
presents challenges for forensic practitioners. The new technology will have an impact on 
how data is analyzed and reported, as well as how it is stored and searched in databases 
(Parson et al., 2016). 
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DNA sequencing platforms 


NGS platforms all have one thing in common: MPS of clonally amplified or single DNA 
molecules separated by space in a flow cell. This design is different from Sanger 
sequencing, which is based on the electrophoretic separation of chain-termination prod- 
ucts generated in individual sequencing reactions. Sequencing in NGS is accomplished 
through repeated cycles of polymerase-mediated nucleotide extensions or, in one format, 
iterative cycles of oligonucleotide ligation. Depending on the platform, NGS generates 
hundreds of megabases to gigabases of nucleotide-sequence output in a single instrument 
run (Voelkerding et al., 2009). 

In 1996, pyrosequencing was introduced as a real-time sequencing alternative to 
Sanger sequencing (Ronaghi et al., 1996). The nucleotides were added sequentially to 
the DNA synthesis reaction, and the pyrophosphate produced was used to generate light 
via a cascade of enzymatic reactions involving the three enzymes ATP Sulfurylase, Lucif- 
erase, and Apyrase. Because the light was detected in real-time by a CCD camera, no 
electrophoresis of the sequencing products was required. Pyrosequencing was less expen- 
sive and faster than Sanger sequencing, and the method was applied to mtDNA 
sequencing (Andréasson et al., 2002) and later also used for STR sequencing (Divne 
et al., 2010). However, the short sequencing length and especially the limited multiplex- 
ing capability of the instruments were not compatible with the low amounts of DNA 
usually recovered from trace samples, and the method was never used in casework 
(Divne et al., 2010). The first commercial high-throughput sequencing platform, the 
Genome Sequencer 20 from 454 Life Sciences, used pyrosequencing (Margulies et al., 
2005). Despite the fact that the first pyrosequencing instruments were never widely 
used in science, with this technology, it was possible to sequence the human genome 
in 5months for $1.5 million (Wheeler et al., 2008). Since then, several high- 
throughput sequencing methods and platforms have been introduced. The majority of 
them have been acquired by larger corporations, and the instruments’ names have occa- 
sionally changed; for example, Solexa was renamed Illumina. Some have come and gone, 
such as Helicos BioSciences’ HeliScope platform (Braslavsky et al., 2003), and Roche 
announced in 2015 that the highly successful 454 pyrosequencing would be phased 
out. The Supported Oligonucleotide Ligation and Detection System 2.0 platform, 
which is distributed by Applied Biosystems, was developed in the laboratory of George 
Church and reported in 2005 along with the resequencing of the Escherichia coli genome 
(Shendure et al., 2005). There are countless numbers of commercial capture assays avail- 
able from different companies, e.g., the SureSelect Human All Exon Kit (Agilent), the 
HaloPlex Exome Kit (Agilent), the lon AmpliSeq Exome RDY Kit (Life Technologies), 
the Nextera Rapid Capture Exome Kit (Illumina), the SeqCap EZ Human Exome Li- 
brary (Roche NimbleGen), the TruSight One Sequencing Panel (Illumina) that targets 
4800 genes, or the lon AmpliSeq Cancer Panel (Life Technologies) that targets 400 genes 
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by PCR. The above-mentioned companies also provide services for the generation 
of customized panels defined by the user for specific projects or purposes (Borsting & 
Morling, 2015). 


Overview of the methodology 


Although each NGS technology has its own approach to sequencing, the NGS platforms 
share a common foundation of template preparation, sequencing and imaging, and data 
analysis (Metzker, 2010). 


Template preparation 


The process of template preparation entails creating a nucleic acid library (DNA or com- 
plementary DNA (cDNA)) and amplifying it. The DNA (or cDNA) sample is frag- 
mented, and adapter sequences (synthetic oligonucleotides of a known sequence) are 
ligated onto the ends of the DNA fragments to create sequencing libraries. Libraries 
are clonally amplified in preparation for sequencing after they have been produced (Ber- 
glund et al., 2011; Quail et al., 2012). 


Sequencing and imaging 


In order to retrieve nucleic acid sequences from amplified libraries, most platforms rely on 
synthesis by sequencing. The library fragments serve as a template for the creation of a 
new DNA fragment. Washing and flooding the fragments with the known nucleotides 
in a sequential order is used to sequence them. Nucleotides are digitally recorded as se- 
quences as they are incorporated into the growing DNA strand. For detecting nucleotide 
sequence information, the platforms rely on a slightly different mechanism for detecting 
nucleotide sequence information. For example, the Life Technologies Ion Torrent Per- 
sonal Genome Machine (PGM) uses semiconductor sequencing, which detects pH 
changes caused by the release of a hydrogen ion as a nucleotide is incorporated into a 
developing strand of DNA (Quail et al., 2012). The MiSeq, on the other hand, depends 
on the detection of fluorescence caused by the integration of fluorescently labeled nucle- 
otides into the expanding DNA strand (Quail et al., 2012). 


Data analysis 


Raw sequence data must go through numerous analysis procedures after it has been 
sequenced. Preprocessing the data to remove adapter sequences and low-quality reads, 
mapping the data to a reference genome or de novo alignment of the sequence reads, 
and analyzing the collected sequence are all part of a generic data analysis workflow 
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for NGS data. The sequence analysis can include a wide range of bioinformatics assess- 
ments, such as genetic variants requiring the detection of SNPs or indels (i.e., the inser- 
tion or deletion of bases), the discovery of novel genes or regulatory elements, and the 
assessment of transcript expression levels. The identification of both somatic and germline 
mutation events that may contribute to the diagnosis of a disease or genetic condition can 
also be included in the analysis. There are numerous free online tools and software pack- 
ages available to perform the bioinformatics required to fully analyze sequence data 
(Gogol-D6ring & Chen, 2012). 

FDSTools, an open-source software program created especially for this purpose, can 
be used to examine, decipher, and summarize forensic MPS-STR data. It can offer a so- 
lution that would allow us to not only process all raw MPS data but also to have a 
straightforward, portable tool that would enable a DNA specialist to summarize, illus- 
trate, and flexibly explain nearly all individual DNA sequences in court. When MPS is 
employed, STRs are assessed as sequence variations with distinct, precisely determinable 
stutter features (Hoogenboom et al., 2017). 

For each individual allele, FDSTools analyzes a database of reference samples to iden- 
tify stuttering, and other systemic PCR or sequencing errors. Moreover, stutter models 
are developed for every repeating element to forecast stutter artifacts for alleles that are 
not represented in the reference set. Following that, the noise in a sequence profile is 
identified, and any necessary adjustments are made. The end result is a more accurate 
depiction of a sample’s actual composition (Hoogenboom et al., 2017). Understanding 
the experimental error profile of the entire analytical technique is essential for a thorough 
interpretation of each experiment’s findings. In contrast to CE, where one “only” needs 
to monitor problems like peak-bleed through, peak shifts, allele imbalance, unusually 
high stutters, and new alleles at unexpected locations in the STR profile, MPS requires 
one to investigate the sequence variation among many millions of individual sequence 
reads. Similar to CE, where peaks are typically only accepted as genuine peaks if they 
exceed a predetermined fluorescent intensity detection threshold (let’s say 50 RFUs), 
with MPS, at least when using FDSTools, one must also set an analytical threshold, or 
AT, in this case, the quantity of reads with the same sequence structure. The experi- 
mental design has a significant impact. One would anticipate fewer readings per sample 
and per locus (in the case of a multiplex STR design) in a run with many different DNA 
samples pooled for database purposes than one with fewer samples pooled. 


The impact of NGS on genomics research 


NGS has significantly accelerated multiple areas of genomics research in the previous 
years since the first commercial platform became available, enabling experiments that 
were previously not technically feasible or affordable (Voelkerding et al., 2009). The ap- 
plications of NGS appear to be nearly limitless, allowing for rapid advances in many fields 
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related to the biological sciences. The human genome is being resequenced in order to 
identify genes and regulatory elements involved in pathological processes (Grada & 
Weinbrecht, 2013). Through whole-genome sequencing of a wide range of organisms, 
NGS has also provided a wealth of information for comparative biology studies. NGS is 
used in the fields of public health and epidemiology to identify novel virulence factors by 
sequencing bacterial and viral species. As NGS becomes more popular, it is inevitable that 
new and innovative applications will emerge. 


Genomic analysis 


NGS’s high-throughput capability has been used to sequence entire genomes ranging 
from microbes to humans (Margulies et al., 2005; Pearson et al., 2007). NGS can be 
used to sequence entire genomes or be constrained to specific areas of interest, including 
all 22,000 coding genes (a whole exome) or small numbers of individual genes (Behjati & 
Tarpey, 2013). The capacity to sequence full human genomes at a low cost using NGS 
has sparked an international effort to sequence thousands of human genomes over the 
next decade (http://www.1000genomes.org), allowing for unprecedented characteriza- 
tion and cataloging of human genetic variation (Voelkerding et al., 2009). 


Targeted genomic resequencing 


Sequencing of genomic subregions and gene sets is being used to identify polymorphisms 
and mutations in cancer-related genes and regions of the human genome implicated in 
disease by linkage and whole-genome association studies (Voelkerding et al., 2009; 
Yeager et al., 2008). Particularly in the latter case, regions of interest can range from hun- 
dreds of kb (Almalki et al., 2017; Voelkerding et al., 2009; Yeager et al., 2008) to several 
Mb in size. Several genomic-enrichment steps, both traditional and novel, are being 
incorporated into overall experimental designs to maximize the use of NGS for 
sequencing such candidate regions (Voelkerding et al., 2009; Yeager et al., 2008). 
Although whole-genome and whole-exome sequencing are possible, targeted 
sequencing of specific genes or genomic regions is preferred in many cases where a sus- 
pected disease or condition has been identified. Targeted sequencing is less expensive, 
provides significantly more coverage of genomic regions of interest, and reduces 
sequencing cost and time (Xuan et al., 2013). Researchers have started to create 
sequencing panels that target hundreds of genomic regions known to be hotspots for 
disease-causing mutations. These panels sequence only desired regions of the genome, 
excluding the majority of the genome from analysis (Grada & Weinbrecht, 2013). Re- 
searchers and clinicians can create targeted sequencing panels that include specific 
genomic regions of interest. Furthermore, sequencing panels that target common regions 
of interest, such as hotspots for cancer-causing mutations, can be purchased for clinical 
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use (Rehm, 2013). Targeted sequencing, whether of single genes or entire panels of 
genomic regions, aids in the rapid diagnosis of many genetic diseases. The findings of 
disease-targeted sequencing can help with therapeutic decisions in a variety of diseases, 
including many cancers for which treatments can be cancer-type-specific (Rehm, 2013). 


Metagenomics 


The use of NGS has had a significant impact on the study of microbial diversity in envi- 
ronmental and clinical samples. In practice, genomic DNA is extracted from an inter- 
esting sample, converted to an NGS library, and sequenced. The output sequence is 
aligned to known reference sequences for microorganisms predicted to be present in 
the sample. Closely related species can be identified, and distantly related species can 
be deduced. Furthermore, de novo data set assembly can yield information to support 
the presence of known and potentially new species. Examples of metagenomic studies 
include the analysis of microbial populations in the ocean (Huber et al., 2007) and soil 
(Urich et al., 2008). 


The impact of NGS on forensic genetics 


The idea of sequencing every DNA (and/or RNA) molecule in the sample appeals to a 
forensic geneticist, who is used to dealing with the difficulty of obtaining sufficient in- 
formation from trace samples, which frequently contain DNA from more than one 
contributor (Borsting & Morling, 2015). Today, the core forensic markers are typed 
with PCR-CE, and there are individual assays for autosomal STRs, Y-chromosome 
STRs, X-chromosome STRs, indels, mtDNA SNPs, autosomal SNPs, Y- 
chromosome SNPs, ancestry informative markers (AIMs), phenotypical markers, 
mRNA, etc. (de Knijff, 2019). One of the major advantages of NGS is that all (or 
most) of the PCR-CE assays can be combined into a single NGS assay if a capture for 
the relevant loci can be developed, such as in Precision ID GlobalFiler (Precision ID 
NGS STR Panel v2), which is compatible with the Ion S5/S5XL NGS platform from 
Applied Biosystems. Moreover, The ForenSeq DNA Signature Prep Kit (Verogen), 
formally released in 2015, has been extensively validated through a wide range of perfor- 
mance tests, including robustness, reproducibility, concordance with CE, and sensitivity 
of detection (Almalki et al., 2017; Churchill et al., 2016; Jager et al., 2017; Kocher et al., 
2018; Silvia et al., 2017). Additionally, there are specific kits that can process challenging 
samples in order to assess their applicability to genuine forensic cases, such as the DNA 
Signature Prep Kit, which has been used on numerous occasions (Almohammed et al., 
2017; Kocher et al., 2018; Ma et al., 2016). In these tests, the kit outperformed CE 
on formalin-fixed paraffin-embedded tissue and bone samples dating from the 7th to 
18th centuries, detecting a greater number of informative markers in seven out of 10 cases 
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(Churchill et al., 2016). The DNA Signature Prep Kit allows the most discriminating 
markers to be detected from limited samples, despite the level of DNA degradation 
(Khubrani et al., 2019; Votrubova et al., 2017). The increased discriminating power pro- 
vided by combining STRs and SNPs will be useful in demanding cases such as mixture 
analysis, complex kinship cases, and degradation results of partial profiles (Khubrani et al., 
2019; King et al., 2018). 

NGS can be used to screen for variants in genes involved in the metabolism of specific 
drugs as well as to supplement toxicology investigations of a deceased person to deter- 
mine whether an unexpected death was accidental or premeditated (VISSER et al., 
2005). It will also be possible to examine DNA from bacteria, viruses, phages, and fungi 
from the deceased in order to identify disease-causing microorganisms or to look for mi- 
crobial community imbalances that may provide clues to the cause of death (Cox et al., 
2013). The sequencing of the microbiome in swabs or soil samples revealed significant 
differences in the taxa found in different locations (Giampaoli et al., 2014). This could 
be used to compare similarities between trace and reference samples. However, it should 
be noted that perfect matches, if not exclusions, are unlikely because the microbiome is 
constantly changing due to environmental factors such as temperature, humidity, sam- 
pling time, and so on. It was also discovered that samples were taken a few meters apart 
at the same time and only shared 50% of the microbiome diversity. Nevertheless, the 
variation between sampling sites was much higher (Young et al., 2014). 


Short tandem repeats sequencing 


Because of the large national DNA databases with STR profiles from criminal offenders 
and irreplaceable trace samples from old cases, STRs are essential to criminal casework 
and will continue to be so. As a result, any NGS assay intended for forensic genetics 
must be capable of sequencing the core STR loci. However, most NGS studies focus 
on SNPs, small indels, and copy number variations, while repeats have received little 
attention, despite the fact that repeats cover nearly half of the human genome and 
STRs alone account for 15% (Treangen & Salzberg, 2012). The discovery of numerous 
new STR and SNP-STR alleles with similar sizes renders the old PCR-CE-based STR 
allele nomenclature obsolete. In literature, Gelardi et al.’s (Gelardi et al., 2014) nomen- 
clature was followed, which divides the name into four elements: (1) the locus name, (2) 
the length of the repeat region divided by the length of the repeat unit, (3) the se- 
quence(s) of the repeat unit(s) followed by the number of repeats, and (4) variations in 
the flanking regions. Because NGS may identify more alleles, the implementation of 
NGS-based assays in forensic genetics necessitates a complete reevaluation of the current 
STR frequency databases. Another intriguing finding from the same study (Gelardi et al., 
2014) was that approximately 30% of homozygous genotype calls made by PCR-CE 
turned out to be heterozygous when the individuals were sequenced. This demonstrates 
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yet another significant advantage of STR-NGS. If the contributors have alleles of the 
same size with different sequence compositions or if the true allele of the minor contrib- 
utor has a different sequence than the stutter artifact of the major contributor, sequencing 
complex and compound STRs with many alleles of the same size may simplify mixture 
interpretation. It was recently demonstrated that NGS could detect minor contributor 
sequences in 1:100 and 1:50 mixtures (Fordyce et al., 2015), something that is impossible 
with the current PCR-CE technology. The reads from the minor contributor will be 
difficult to separate from stutters and noise sequences in these types of mixtures, but 
the fact that they can be identified opens up new possibilities in mixture interpretation, 
and it is certainly something that should be investigated further (Borsting & Morling, 
2015). 


The initial commercial NGS kits for forensic genetics 


In 2014, Thermo Fisher Scientific released two SNP typing assays for the lon PGM Sys- 
tem: (1) the HID-Ion AmpliSeq Identity Panel for human identification (Borsting et al., 
2014), which amplifies 124 autosomal SNPs, including the majority of the SNPforlD 
(Sanchez et al., 2006) and individual identification SNPs (IISNPs) (Pakstis et al., 
2010); and 34 Y-chromosome SNPs, as well as (2) the HID-Ion AmpliSeq Ancestry 
Panel for ancestry estimation, which includes the majority of the AIMs from the Seldin 
(Nassir et al., 2009) and Kidd laboratory selection panels (Nievergelt et al., 2013). 
Thermo Fisher Scientific’s strategy at the time was to develop assays that would be 
used in addition to PCR-CE typing. Illumina, on the other hand, had stated that their 
strategy would be to replace PCR-CE with PCR-NGS. The ForenSeq DNA Signature 
Prep Kit was developed in the fall of 2014 to amplify 27 autosomal STRs, 8 X-STRs, 25 
Y-STRs, 95 autosomal human identification SNPs, 56 autosomal AIMs, and 24 auto- 
somal SNPs associated with pigmentary traits in a single multiplex PCR. The multiplex 
includes, among others, all of the STR loci in the CODIS and European standard sets, 
most of the SNPforID (Sanchez et al., 2006) and IISNPs (Pakstis et al., 2010) and all 
of the HIrisPlex loci (Walsh et al., 2013). The ForenSeq DNA Signature Prep Kit was 
introduced together with the MiSeq FGx platform, a MiSeq developed specifically for 
forensic genomics. One of the most difficult challenges will be creating a forensic 
NGS tool for analyzing and reporting sequence data. With NGS data, it is not possible 
to manually analyze the sequences or even each genotype call. As a result, before they 
can be used in criminal cases, software solutions must be completely trustworthy and 
thoroughly validated (Borsting & Morling, 2015). Software modules can be used to es- 
timate bio-geographic ancestry, mt DNA haplogroups, Y-chromosome haplogroups, tis- 
sue identification, and phenotypes, among other things. 

For the forensic community, commercially available MPS kits with PCR-based pro- 
cedures targeting several loci in sizable multiplexes have been created, as discussed earlier. 
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These tests, however, might not be appropriate for forensic evidence with degraded 
DNA, like skeletonized remains and hair shafts. DNA damage from postmortem envi- 
ronmental exposures is known to occur in the form of base alteration such as depurina- 
tion and cytosine deamination as well as fragmentation (Hofreiter et al., 2001). It has been 
demonstrated that DNA from chemically and age-treated bone samples is comparable to 
ancient DNA, with average lengths as low as 70 bp (Marshall et al., 2017). 

Due to the DNA degradation, commercial forensic MPS tests, which typically 
require >100 bp fragments, may not be able to successfully enrich the sample. Thank- 
fully, hybridization capture and MPS can be used to assess this and other damaged 
DNA samples with short fragments. Single-stranded DNA or RNA probes are used 
in the hybridization process to anneal or hybridize denatured DNA strands that are 
complementary to the probe sequence. These hybridized DNA strands are then 
collected while the unhybridized DNA is washed away. DNA fragments as short as 
30—35 bp in length are susceptible to hybridization capture, making it possible to suc- 
cessfully profile DNA from even the most damaged forensic DNA samples (Gorden 
et al., 2021). 


Current status of next generation sequencing 


Few of the MPS-based results from forensic investigations have been presented in court. 
This shows that MPS is still not frequently applied in forensics (de Knijff, 2019). This is in 
direct opposition to other fields of genetic diagnostics like oncogenetics and clinical ge- 
netics, where tens of thousands of DNA samples have been regularly tested each year 
since 2010, utilizing either targeted genome sequencing or whole genome sequencing 
with MPS. There are, of course, a variety of explanatory factors for this. First, these other 
genetic fields have been using SNP array platforms to screen DNA samples for genetic 
variants at the entire genome level for at least 20 years without ever being constrained 
by the availability of usually minute amounts of degraded DNA (LaFramboise, 2009) 
using techniques for bioinformatics to analyze their findings. It was a relatively short 
step from there for these laboratories to transition into the more complicated MPS 
era. They could quickly profit from the already established and authoritative human 
genome variation society nomenclature criteria, and they were also quick to act in terms 
of ethical norms (den Dunnen et al., 2016). The disparity in the measured outcomes of 
the two technologies is the most evident and, in many ways, also the most significant 
distinction between MPS results and CE results. For the purpose of STR genotyping, 
CE converts machine-measured DNA molecule migration times into DNA fragment 
lengths. To make understanding easier, these fragment lengths are then represented in 
peak profiles and tables as a relatively straightforward string of numbers. The underlying 
base pair variation of the examined DNA sample is not disclosed by CE, though this has a 
major consequence: CE analysis of STRs (Kimpton et al., 1993) undervalues the genetic 
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variation that exists in the DNA sample at the genetic level. It is commonly recognized 
that homoplasy, which is the inability to detect similar-sized DNA fragments with varied 
sequence contents but identical fragment sizes in CE, exists. Using MPS, the final exper- 
imental result is expressed as DNA sequences that expose all underlying sequence varia- 
tions in the targeted DNA sample, regardless of the underlying sequence technique. If 
one wants to compare MPS STR results with CE STR data, one can translate these 
DNA sequences, in the case of STRs, into DNA fragment lengths, but this is not neces- 
sarily necessary. Homoplasy is no longer a concern; genuine alleles and stutter alleles can 
be clearly recognized in DNA samples from a single source. Unfortunately, stutter alleles 
that cannot be separated from real minor contributor alleles frequently make it difficult to 
analyze unbalanced mixtures with low minor contributions. The incorrect base pair sub- 
stitutions, primarily caused by DNA editing mistakes made during PCR, are likewise not 
identified by CE analysis of STRs since they have no effect on the underlying fragment 
lengths. In this way, MPS exposes the whole range of errors: (a) DNA slippage during 
PCR causes stutters; (b) DNA editing faults during PCR create base pair errors; 
(c) strand slippage (mostly at homopolymer stretches) during sequencing causes strand er- 
rors; and (d) substitution type miscalls during sequencing result in base pair errors 
(Schirmer et al., 2015). There is enough data to conclude that the latter two sources 
of error are platform -specific to the sequence. The output of STR CE analysis is then 
converted into a very basic data file that essentially simply provides the names of the 
STR loci, the length(s) of the STR alleles, and the fluorescence intensities (or peak 
heights) identified by the CE platform. Peak profiles, which have been in use for 
more than 20 years and are simple to explain but not necessarily easy to grasp, can be 
used to display these data (Kircher et al., 2009). 

Two very straightforward file formats, FASTQ (Cock et al., 2009) and/or FASTA 
(Pearson & Lipman, 1988), each of which can hold all individual DNA sequence reads 
generated during the MPS analysis, can be used to store the results of an MPS experi- 
ment. Nevertheless, because MPS platforms generate between a few tens of millions 
and a few hundred million sequence reads in a single experiment, one must rely on spe- 
cifically created software that converts these millions of reads into an experimental sum- 
mary that one can comprehend and communicate. The MPS software packages used to 
interpret, summarize, and visualize all sequence data also need to be able to distinguish 
between them in such a way that, ideally, one can always, in retrospect, go back to the 
original sequence data and explore them in alternative ways if requested. This is because 
these DNA sequence files contain all reads produced by the platform, i.e., those repre- 
senting true alleles and those containing any kind of PCR/sequence error. With MPS, 
there is no longer a gold standard in terms of the platform and software, in contrast to the 
CE analysis of STRs. As a result, it is more challenging to compare MPS results directly 
across platforms, laboratories, and forensic DNA experts. Also, and probably more 
importantly, new nomenclature standards that, ideally, have maximum clarity are 
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required in order to convert MPS STR results into a format that can be compared with 
CE STR results. In short, MPS data represent the whole range of potential DNA 
sequence variation, whereas CE results only tell us about DNA fragment length varia- 
tion when used to investigate the STR variation in a crime scene sample. It may be 
tempting to convert intricate MPS STR results into something akin to CE STR results, 
but doing so would mean ignoring all further genetic variation data that would be essen- 
tial for a criminal investigation (Hoogenboom et al., 2017). There may be legal restric- 
tions in many countries that could prevent the reporting scientists from including all 
MPS results of a forensic investigation in the form of, for example, a FASTA file or 
the full description of all the sequence variations because, with MPS one obtains the 
full sequence structure of the DNA fragments that were targeted. These genetic records 
are undoubtedly available and have been added to the overall inquiry file, but both the 
prosecution and the defense must make separate requests to obtain these supporting re- 
cords. When it comes to CE, this extra data is typically presented as peak profiles, elec- 
tropherograms, or tables that list all the underlying STR data. As previously said, this is a 
little more complicated for MPS. There are also legal limitations on the precise locus 
types that can be used (and reported) in forensic DNA investigations (de Knijff, 
2019). Only digitally recorded and only made available upon specific request, the un- 
derlying data for the MPS experiment is available in FASTQ and FASTA files and in- 
cludes all error reads, an error read summary, and other pertinent data on the quality of 
the experiment. For presenting MPS-STR results in court, all data is available as 
HTML-format files that may be viewed and debated with the aid of a graphical user 
interface and any browser after being processed with FDSTools (Hoogenboom et al., 
2017). 

There is a lot more information that needs to be kept in addition to the genetic data, 
whether it be error readings or reads reflecting real alleles. It is common practice to 
combine many DNA samples for sequencing because a single run on an MPS machine 
generates millions of sequence reads. How many you can pool depends a lot on how 
many sequence reads are needed for each sample and each allele. A flexible case-by- 
case and/or sample-by-sample experimental strategy is now more viable as a result. It 
is quite easy to store the outcomes of STR CE genotyping in any STR profile database. 
The three default parameters that can be entered as extremely basic and condensed text 
strings are the sample code, locus name, and genotyping result. Individual peak heights 
and the multiplex STR kit used to create the profile are two examples of additional data 
that may be beneficial to include. Companies that sell multiplex STR genotyping kits 
take into account these guidelines because there is a universally recognized, consistent, 
standardized allele-calling nomenclature for CE. There are absolutely no standards for 
MPS. MPS can be executed on a variety of systems. While there have been a few recent 
initiatives, there are not yet any universal nomenclature rules (Phillips et al., 2018). Unless 
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one chooses to utilize an allele coding system like that used for the HLA system since 
1968, it would still be exceedingly challenging to condense the entire range of sequence 
variation of every MPS discovered STR allele into a short and easy text string (“Nomen- 
clature for Factors of the HL-a System,” 1968). The primary benefit of an allele coding 
system like the one used with HLA is that the result of the STR sequence can be recorded 
as a relatively short allele designator. 


Recommendations 


Often, this entails a comprehensive set of suggestions or rules that offer standards for every 
conceivable technological, interpretative, and reporting difficulty. A few practical prob- 
lems also need to be resolved, such as how to make the various National DNA databases 
accept STR alleles identified by MPS that include the complete range of genetic diversity 
found. Due to the large variety of MPS platforms and software that enable MPS out- 
comes, as opposed to CE, providing such recommendations will be more difficult 
because they must cover a much wider range of difficulties. At least the following con- 

cerns should be included in these, in no particular order (de Kniff, 2019): 

1. Astandardized nomenclature for MPS-based STR alleles that enables understanding/ 
reconstructing the entire genetic variation spectrum without the necessity for back 
referencing, as in the case of the HLA nomenclature system. 

2. Recommendations for the bare minimum of reads needed to accurately identify an 
STR allele in different scenarios, such as a straightforward reference database sample, 
a sample from a single crime scene, or a sample from a combination of multiple crime 
scenes. 

3. There are three recommendations that can be utilized to give details on the whole 
range of nontarget (or incorrect) reads. This can also involve disclosing details 
regarding the sample pool method that was applied to the samples’ screening as 
well as the documentation of barcode techniques. 

4. Suggestions on the MPS approach that was employed: Was a less inclusive sequence 
technique employed, or were the entire reads 1 and 2 sequenced forward and reverse 
before being assembled and aligned? Was the PCR amplicon sequenced in its en- 
tirety, depending on the MPS platform, or were assembled, partially sequenced 
amplicons aligned? 

5. Suggestions for the file formats required to hold all MPS results. 

6. Software used to analyze and summarize MPS results must meet six minimum re- 
quirements. What details should be available right away? 

7. It is necessary to adapt statistical software programs to the novel allele designations 
provided by MPS experiments in order to understand the evidentiary value of CE- 
based matches between STR profiles. 
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Concluding remarks 


High-throughput sequencing has accelerated research in a wide range of biological and 
applied science fields. The use of NGS in forensic genetics has been debated in recent 
years, and we are now seeing applications aimed specifically at human identification 
and phenotypic trait determination. The cost of instruments and kits will determine 
how quickly the transition from CE to NGS takes place, as well as the proportion of cases 
investigated by NGS. Other critical aspects of incorporating NGS into forensic genetics 
include the development and validation of software solutions. 
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Introduction 


Homecoming seemed to come early this year. The leaves had given way to the vibrant 
red and yellow fall colors; pep rallies were being held on campus; tailgating parties were 
popping up everywhere near the stadium; and an overall excitement seemed to fill the air 
on campus. After the game, there would be, no doubt, endless fraternity and sorority 
parties to celebrate the hometown football team’s victory. Alpha Epsilon Pi, a major fra- 
ternity on campus, was hosting a party after the game, and everyone was expected to 
attend. Slightly past midnight, a female at the party, who was known around campus 
for her sexual promiscuity, started to solicit male companions presumably for sexual fa- 
vors. After a short stint with one of the fraternity brothers, the female returned to the 
party and continued to consume alcohol. Several drinks later, she met another male 
and proceeded to return to the bedroom where she had been previously with the first 
male companion. However, by this time the female was quite inebriated, and after a short 
bit of “tossing” in the bed with the second male companion she passed out. Upon awak- 
ening in the early morning, the female noticed that her underwear was moist, that her 
clothing was disheveled, and that almost everyone, including the two males, had left 
the party. Not being able to recall what had happened with the two males the night 
before, and after some deliberations, the female was presented to the local hospital by 
midmorning and claimed that she had been raped. The sexual assault nurse examiner 
(SANE) questioned the young lady and collected evidentiary samples such as clothing 
(i.e., blouse, skirt, and underwear) and swabs from various areas of her body. The eviden- 
tiary samples, as well as a known reference sample, were sent to the state forensic labo- 
ratory for DNA analysis. Known reference samples were later collected from both of the 
male companions and also sent to the laboratory for DNA analysis. Following the 
completion of the DNA testing, the state forensic laboratory reported its findings to 
the state’s attorney assigned to the case. In essence, the state laboratory reported that 
both males were minor contributors to the DNA profiles developed from the vaginal 
swabs of the victim, but a third major DNA profile was isolated from the underwear 
that was from an unidentified male. Upon further investigation and after additional 
testing, this DNA profile was determined to be from the fraternity brother whose bed 
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and room were used by the female and two male companions on the night of the party. 
This third male was ultimately arrested and charged with first-degree rape. 

The above case scenario is not an atypical situation or sexual assault charge. In this 
instance, the victim of such a sexual assault doesn’t recall most of the events that occurred 
the night before except that there was more than one male involved who probably had 
consensual sex with her or by force. However, in this case, the male that was ultimately 
arrested and charged with rape was not involved in the sexual assault, but his DNA 
incriminated him simply because it had transferred from his bed to the victim’s under- 
wear. In this situation, all victim(s) should report such assaults to local authorities in an 
effort to identify the perpetrator(s) and find justice. This chapter will explore how 
such events/scenarios can occur and how forensic analysts need to be aware of such sit- 
uations when the deposition of DNA is in question, especially in light of the sensitivity of 
today’s technology and instrumentation and the ability to detect DNA from multiple 
sources. 


Touch DNA (or trace DNA) 


In “an introduction to forensic DNA analysis,” Radin and Inman (2002) define contamina- 
tion “‘as the inadvertent addition of an individual’s physiological material or DNA during 
or after collection of the sample as evidence ...a contaminated sample is one in which the 
material was deposited during collection, preservation, handling, or analysis.” When the 
addition of biological material is deliberately made to a sample, this constitutes evidence 
tampering. The unintentional addition of biological material during sample collection 
and handling constitutes contamination that could have a significant impact on the inter- 
pretation of the results and courtroom proceedings. 

Forensic evidence needs to be free of foreign material, such as external DNA or other 
trace biological material for the testing process to present a true and accurate representa- 
tion of the evidence and crime scene. Consequently, crime scene investigators and ana- 
lysts that handle evidence are required to adhere to stringent protocols that require 
wearing protective clothing (i.e., gloves, masks, efc.) to avoid contaminating the evidence 
with shed hair and skin cells. Such protective “gear” protects the investigator/analyst 
from the sample (samples should always be handled as potentially infectious agents) 
and the sample from the handler (shed hair and sloughing of skin cells will contaminate 
the sample). Since DNA test methods are sensitive enough to generate a full DNA profile 
from as few as one to five shed skin or epithelial cells, great care in the handling of eviden- 
tiary samples is critical and must be maintained at all times. 

The human body is constantly shedding dead skin cells—between 30,000 and 40,000 
skin cells are shed every hour (Milestone, 2004). Thus, over a 24-h period, an individual 
may lose almost a million skin cells (Sender et al., 2016). Such shed epithelial cells may 
have nuclei (containing 23 pairs of chromosomes available for DNA testing) that are 
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visible in the cytoplasm of the cell or may have lost the cellular contents due to leakage or 
apoptosis (programmed cell death). In addition, several studies have demonstrated the 
presence of cell-free DNA from senescent cells, which can be used for DNA profiling 
(Anker et al., 2001; Botezatu et al., 2000; Garcia-Olmo et al., 2004). The nucleated cells, 
however, will contain DNA that can be useful for typing and linking an individual to a 
crime scene. In addition to the “normal” shedding process, the transfer of biological ma- 
terial from one object to another is influenced by both environmental (e.g., friction and 
pressure, and length of handling time) and genetic (e.g., “shedder” status) factors. These 
factors play a significant role in establishing the type and number of cells deposited on an 
individual (i.e., skin surface or clothing) or an inanimate object. 

Since the introduction of DNA profiling, the analysis of DNA extracted from biolog- 
ical material has become a daily occurrence in forensic laboratories worldwide. Over the 
years, the technology to extract and analyze DNA has become very sophisticated. One 
such technology, called “touch DNA” or “contact trace DNA,” refers to the DNA 
that is recovered from skin cells that are left behind when a person touches or comes 
into contact with items such as clothes, a weapon, or other objects. These epithelial cells 
can be collected using lifting tape, swabbed with a Q-tip, or even scraped from the clothes 
of the victim or objects and viewed microscopically and/or subjected to DNA analysis. 


Direct and indirect transfer of biological evidence 


Primary transfer. In the early 20th century, Edward Locard, a forensic scientist in France, 
theorized that the perpetrator of a crime would bring “something” to the crime scene and 
leave with “something” from the crime scene. This concept, known as Locard’s ex- 
change principle, leads to the statement that “every contact leaves a trace” and that 
such items that are left behind or picked up can be used as forensic evidence. Locard’s 
exchange principle essentially states that when two objects come into contact with 
one another a transfer of material can occur. For example, when an individual picks 
up a pen epithelial cells or skin cells can be left on the touched item. This sort of transfer 
can also occur when an individual touches or comes into contact with another person. 
The transferred item, such as skin cells or hair, could then be analyzed for DNA. This 
type of exchange between an animate or inanimate object is called primary or direct 
transfer. Many other types of primary transfers have been documented. These include 
the transfer of saliva (which contains skin cells) when kissing a person, the exchange of 
skin cells during a handshake, or the transfer of skin cells on the waistband of a victim’s 
underwear during a sexual assault. Keep in mind that all of these examples of primary or 
direct transfer involve the exchange of biological material that contains DNA. 
Secondary transfer. Biological evidence can also be transferred through indirect contact. 
Indirect contact is when biological material or evidence is transferred to an animate or 
inanimate object and then transferred to another person or object. For example, if an 
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individual handled a pen or pencil and then another person picked up that pen or pencil, 
the second individual could pick up material or evidence from the first person who 
touched the object. Another example of indirect transfer is when hair or skin cells are 
left on a bed by one person but are then transferred to another (or second) person that 
comes into contact with the biological material. These examples of indirect transfers 
are also known as secondary transfers. In these examples, the DNA from the first person 
is transferred or exchanged and now can be detected on the second person (or object) that 
contacted the material or evidence. Imagine the legal consequences of such interactions. 
In essence, an individual could be identified from skin cells or DNA left on an important 
piece of crime scene evidence, accused of the crime, while never being present at the 
scene. 

Tertiary transfer. Another type of transfer is called a tertiary transfer. For example, the 
perpetrator of rape leaves semen on the victim’s bedsheet (direct or primary transfer). 
Then, the victim touches or comes into contact with semen on the bedsheet (indirect 
or secondary transfer). Finally, the victim wipes their hands or their body with a wash- 
cloth, and some of the semen transfers from their hands/body to the washcloth. Again, 
because of secondary and now tertiary transfer, biological evidence can be found in a 
location where the biological source of the evidence (i.e., the individual) was never 
present. 

As demonstrated, biological material such as skin cells, semen, blood, saliva, and urine 
can be transferred from one object to another, analyzed, and lead to the arrest and convic- 
tion of the perpetrator. However, such a transfer of biological evidence could lead to 
wrongful convictions. As an example, in June 2012, Lukis Anderson, a homeless man, 
was charged with capital murder for the death of Raveesh Kumra, a wealthy multimil- 
lionaire from Silicon Valley. This charge was based on Anderson’s DNA found at the 
crime scene. However, at the time of the murder, Anderson was hospitalized due to 
intoxication and in a nearly comatose condition, requiring medical attention and super- 
vision. The question was, How did Anderson’s DNA end up at the crime scene? During 
the investigation, Anderson’s legal team learned that the paramedics that treated Ander- 
son had actually responded to the murder scene of Kumra and had inadvertently trans- 
ferred Anderson’s DNA at the scene (Smith, 2016). This case of secondary transference is 
an example of the errors or risks that may occur when relying on DNA evidence. More- 
over, such “misplaced” DNA evidence would, no doubt, lead to the wrongful convic- 
tion of an innocent person. An illustration of the primary, secondary, and tertiary transfer 
of biological material is shown in Fig. 2.1. 

Effects of surface/substrate “material” and biological transfer. As demonstrated above, the 
transfer of biological evidence from an individual to an item or another person at a crime 
scene can happen in several ways. The amount of biological evidence or DNA transferred 
to a substrate depends on many factors, whereby the nature of the substrate (or object) is 
critical. Such findings would allow analysts to determine the “order” to analyze evidence 
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to save time and other valuable resources, especially in cases where a large number of 
evidentiary samples have been submitted for analysis. Daly et al. (2012) performed a study 
where three different substrates were selected to determine the amount of DNA that was 
transferred. The items or substrate materials in this study consisted of glass, fabric, and 
wood. These items were selected to represent objects such as bottles, firearms, clothing 
from a crime scene or sexual assault victim, and handles on knives. Volunteers handled 
the various items for 60 s; the DNA was recovered, quantified, extracted, amplified, and 
the profiles were analyzed and compared. Success rates for obtaining adequate DNA pro- 
files of the handlers were approximately 36% for wood, 23% for fabric, and 9% for glass. 
These results suggest that the order of sampling or testing by the analysts in the laboratory 
should be surfaces made of wood, then clothing, and finally glass since substrates made of 
glass will have a low expectation of obtaining a meaningful DNA profile. 

In another study, Fonnelop et al. (2015) demonstrated the transfer of DNA using 
various substrates. One part of this study investigated the primary transfer of touch 
DNA from plastic, wood, and metal surfaces. The results from this study demonstrated 
that DNA was readily transferred to wood and plastic, but less DNA was transferred to 
a metal substrate such as a door handle. These data support the earlier findings of Daly 
et al. that more DNA is transferred to wood than to glass, or in this instance, metal. 
Rough surfaces on the wood appeared to facilitate cell transfer and, thus, DNA, by 
the abrasive action. In addition, Fonnelop et al. were able to generate a full DNA profile 
after three transfers. Goray et al. (2010) also showed that the transfer of DNA was signif- 
icantly greater when deposited on a porous substrate such as cotton than when deposited 
ona smooth, hard surface such as plastic. Lehmann et al. (2013) conducted a study where 
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a series of transfers of touch DNA were deposited on cotton or glass and then transferred 
onto “new” substrates, with six transfers in each series. A full DNA profile was observed 
on the first substrate, whereas a partial profile was only observed on the second to the fifth 
glass substrate. This study emphasized that great care and caution are needed when 
handling and examining evidentiary samples to prevent the transfer of biological material. 


Effects on transfer and shedder status (assessment of shedder status 
and implications for secondary DNA transfer) 


History of detecting biological transfer. In 1996, Kafarowski et al. initiated a study to evaluate 
the likelihood of the transfer of spermatozoa during laundering. At this time, there were 
minimum, if any, reports on the likelihood of spermatozoal transfer among laundered 
clothing items. Three couples participated in this study, whereby a semen stain was 
deposited on a clean pair of cotton panties by natural drainage following vaginal inter- 
course. The stains on the panties were air dried, marked or outlined for identification 
later, individually washed with three “pristine” pairs of cotton or cotton blend panties 
for a 10-min warm wash, followed by a cold rinse cycle with a phosphate-free detergent, 
and then machine dried. Following washing, the stains were examined for acid phospha- 
tase (AP) activity, subjected to Christmas tree staining, and examined microscopically for 
the presence of spermatozoa. In all three instances, a trace amount of spermatozoa had 
transferred to the “pristine” panties during laundering. Of the 162 samples excised, 
50% contained at least one sperm head, 38% contained one to two sperm heads, and 
16% contained three to eight sperm heads. These observations revealed that spermatozoa 
could be present on a garment without any sexual activity involved and that these results 
could have strong repercussions with regard to an individual’s testimony. Moreover, 
these results could be of significant importance when one considers cases involving vic- 
tims/complainants who are not sexually active or who are unable to recall the details/ 
specifics of the assault. In the absence of DNA testing, as observed in this study, transfer 
during washing warrants equal consideration with the transfer of biological material as a 
possible explanation for the presence of spermatozoa. 

In 1999, Ladd et al. systematically evaluated the nature of biological transfer in an 
attempt to confirm the findings by Van Oorschot and Jones (1997) regarding the primary 
transfer of DNA from person to person or to various objects and the secondary transfer of 
DNA through an intermediary. In the first part of this study, individuals shook hands for 
various lengths of time (1, 5, 10, 30, and 60s) and then handled precleaned items 
(e.g., door handles, telephone mouthpieces, computer keyboards, computer mice, and 
steering wheels) for 5s. In the second part of the study, precleaned coffee mugs were 
handled for 2 h and then handled by a second individual for 10s. Hands and objects 
were swabbed, the DNA extracted by organic extraction/microcon-purification, a 
method not commonly used today, quantified, and amplified using the AmpliType 
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PM and DQAI typing kits (a kit no longer manufactured by Perkin Elmer Applied Bio- 
systems) according to the manufacturer’s protocol. In addition, biological transfer in this 
study was evaluated using short tandem repeats (STR) analysis using the AmpFISTR 
Profiler Plus and COfiler DNA typing kits. The results showed that the primary transfer 
of DNA using the AmpliType PM and DQA1 typing systems could yield interpretable 
results, but detection was not guaranteed and appeared to be dependent on the individual 
performing the handshake or handling the objects. In many instances, there was insuffi- 
cient DNA to generate an interpretable genotype, and since the smaller loci produced an 
amplified product versus the larger loci, it was determined that degradation appeared to 
be a factor. 

The STR results were similar to those for the dot-blot-based (AmpliType PM and 
DQAl1) systems. Degradation and low DNA yield appeared to be significant factors in 
locus and allelic dropout, with some allelic peaks barely exceeding background levels. 
In fact, a complete secondary STR profile was never detected. The authors concluded 
that “the primary transfer of DNA by handling is possible, but detecting an interpretable 
genotype is not assured.”” Moreover, secondary transfer was not observed; therefore, the 
data “do not support the inference that the interpretation of DNA profiles from case sam- 
ples could be compromised by secondary transfer.” When one considers today’s level of 
improved extraction methods and purification yields, together with the increased level of 
instrumentation sensitivity, there is no wonder why this phenomenon has stimulated 
further investigation. 

For all intense purposes, when a DNA profile is collected from an object at a crime 
scene, it is not possible to determine when the DNA was transferred to the object. To 
make matters worse, one cannot determine if this DNA profile was transferred prior 
to or during the course of the crime. After considering research performed by Van Oor- 
schot et al. (1997) and Ladd et al. (1999), Lowe et al. (2002) demonstrated the difference 
between individuals in their tendency to deposit DNA during primary and secondary 
transfer, under ideal and specific laboratory conditions, to another individual and subse- 
quently to an inanimate object. Specifically, the occurrence of a secondary transfer of 
DNA from one individual to another and then to an object can occur under favorable 
conditions. In fact, a full DNA profile from one individual was recovered from an 
item that they had not touched, “while the profile of the person having contact with 
the item was not observed.” Interestingly, when a 30 min or 1 h delay was incorporated 
between human contact and contact with the object, both of the individuals’ DNA pro- 
files were detected. Interpretations of such transfers between individuals and inanimate 
objects agree with the findings of van Oorschot et al. (1999) as well as Ladd et al. 
(1999). However, interpretations of such results or DNA patterns may depend on the 
circumstances of the case and/or crime scene under investigation as well as the events 
that are described by the victim and/or suspect. 
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In 2012, Daly et al. examined the amount of DNA transferred from an individual to 
various inert surfaces such as glass, fabric, and wood. This study was discussed above, but 
the results demonstrated that there were differences in the amount of DNA transferred to 
these surfaces and that the success rate of achieving a full DNA profile could depend on 
the individual touching the object (i.e., a good vs. a bad shedder), the surface and nature 
of the object, and the activities leading up to the individual touching the object. Unfor- 
tunately, in casework, the forensic analyst will not have, in most instances, information 
about the individual’s shedder status or the activity of the individual prior to touching or 
transferring biological material to an object. 

As interest increased in understanding DNA transfer and its roles in criminal investi- 
gations, Goray et al. (2012) evaluated multiple transfers of DNA using mock crime scene 
scenarios. Although this study revealed mixed results, the authors concluded that the pa- 
rameters of the experiment would have to be defined in order to align with the reported 
sequence of events. Such parameters would include, but not be limited to, type and dura- 
tion of contact, type of substrate that the items (or samples) were transferred to, and envi- 
ronmental conditions such as temperature and humidity, together with other conditions 
that should be incorporated into the experimental design. The authors concluded that 
“such reenactment only becomes relevant and possible when clear and specific secondary 
or further transfer events are proposed during criminal investigation or court proceed- 
ings.” The authors also suggest that a framework or database could be developed that 
comprises variables or factors previously studied that could lead to a predictive value as 
to their role in the transfer of biological material. As such, with the increasing role of 
DNA transfer used to explain the presence of specific biological material, it can be 
reasonable to expect that DNA experts will present and/or argue this line of defense 
as an explanation in courtroom proceedings. 

STR analysis. Polymerase chain reaction (PCR)-based STR systems offer many ad- 
vantages over earlier DNA typing techniques (e.g., restriction fragment length polymor- 
phisms, or RFLP). STR systems provide a rapid and sensitive method to evaluate small 
amounts (1 ng) of human DNA. This small amount of DNA needed for STR systems is 
50 times less than what is normally required for RFLP analysis. Also, the repeating se- 
quences in an STR are relatively short, with the entire STR strand or allele generally 
beingless than 400 bp in length. This short length renders STR systems amendable to 
the analysis of samples suspected of being degraded. STR analysis often allows the 
DNA analyst to recover a complete DNA profile even from an evidentiary sample 
that was exposed to unfavorable conditions (e.g., body or stains subject to extreme 
decomposition). This is in sharp contrast to RFLP systems, which required large sample 
sizes for analysis and full-length fragments, which often consisted of thousands of bases, to 
generate a complete DNA profile. 

More recent studies investigating the biological transfer of evidence have utilized 
STR analysis due to its level of sensitivity, its consistency, and the amount of variation 
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demonstrated between individuals. Human genomes contain 5%—10% of repetitive se- 
quences that occur in tandem or adjacent to each other. These repetitive sequences, 
which appear more than once on the same chromosome, vary in size and length and 
show sufficient variability among individuals in a population. Regions of DNA that 
contain these short, repeated segments, or STRs, are important markers for human iden- 
tity testing in the forensic community. There are literally thousands of STR markers scat- 
tered throughout the human genome, which occur, on average, one in every 10,000 
nucleotides. The DNA sequence repeated in an STR motif is usually from 2 to 7 base 
pairs (bp), with 4 bases being the preferred size for forensic systems (Edwards et al. 
(1991, 1992; Warne et al. 1991). An example of a 4 bp or a tetranucleotide repeat is 
shown below (Fig. 2.2), where the TCTA motif is repeated 4 times. 

STRs and their corresponding loci are easily amplified by PCR. Further, PCR ampli- 
fication of many different STR loci is commonly performed simultaneously in the same 
tube. The simultaneous amplification of two or more loci is commonly known as multi- 
plexing or multiplex PCR. Initially, for a multiplexing reaction to be successful, the sys- 
tem was designed to ensure that the size of the amplified products did not overlap, 
thereby allowing each STR allele for a specific locus to be clearly visualized on a gel 
or by capillary electrophoresis. However, this “requirement design” of overlapping frag- 
ments became less important with the development of multiple color detection systems. 

Different detection methods are available to visualize the STR products. The STR 
loci and corresponding alleles may be separated by gel electrophoresis and detected using 
ethidium bromide, silver staining, or exotic dyes (e.g., SYBR green). Several STR sys- 
tems have been developed where fluorescent dyes or labels are used to detect the 
STR alleles either during (.e., capillary electrophoresis) or after separation (i.e., gel elec- 
trophoresis). The resulting STR profiles are routinely interpreted by direct comparison to 
DNA standards, allelic ladders (an artificial mixture of common genes present in the hu- 
man population for a particular STR marker or locus), and reference standards (known 
DNA profiles from the victim and suspect). Probability calculations are determined based 
on classical population genetic principles. 

For STR markers to be effective across various jurisdictions, a common set of stan- 
dardized markers is used. Currently, the forensic scientific community in the United 
States has established a set of 13 core STR loci, which, in turn, can be entered into a na- 
tional database known as the Combined DNA Index System (CODIS), a collection of 
DNA profiles from known offenders. A summary of the 13 CODIS loci is contained 
in Table 2.1. 


veawevens ATGTGAITCTATCTATCTATCTATIGG........ 
Figure 2.2. A 4 bp (i.e., TCTA) or tetranucleotide repeat where the TCTA motif is repeated 4 times. 
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Table 2.1 Information on the 13 core short tandem loci listed in CODIS. 


STR locus Chromosome number Sequence 

FGA 4 CTTT 

vWA 12 [TCTG][TCTA] 
D3S1358 3 [TCTG][TCTA] 
D21S11 21 [TCTA][TCTG] 
D8S1179 8 TATC 

D7S820 7 GATA 

D13S8317 13 TATC 

D5S818 5 AGAT 

D16S539 16 GATA 

CSF1PO 5 AGAT 

TPOX 2 AATG 

THO1 11 TCAT 

D18S51 18 AGAA 


Persistence and transfer of DNA from laundered semen /biological stains. Sexual assaults ac- 
count for more than 60% of criminally related cases. In fact, as of 1998, one out of every 
six American women has been the victim of an attempted or completed rape in her life- 
time, and because the United States population has increased substantially since then, it is 
probable that the number of victims has as well (National Institute of Justice and Centers 
for Disease Control and Prevention). In most instances, these cases include a comprehen- 
sive case history, a SANE report, and biological samples such as swabs and clothing (e.g., 
underwear, blouses, pants, efc.) that are collected as part of the physical evidence recovery 
kit (PERK). Unfortunately, a number of sexual assault cases involve children and the 
elderly who are unable to provide an accurate and/or detailed account of the assault. 

The identification of biological fluids, especially semen or spermatozoa, begins with a 
physical examination of the clothing and garments submitted in the PERK using an alter- 
nate light source. If presumed biological stains are observed it is imperative to determine 
the biological source of the evidence. If the stain is presumed to be semen, the analyst will 
test for the presence of acid phosphatase (AP) activity and/or the prostate-specific antigen 
(P30) using presumptive assays. If a significant amount of sample is present, a confirma- 
tory test, such as the Christmas tree stain or the rapid stain identification for sperm, will be 
performed. 

When semen is identified on the garments or underwear of the victim sexual activity 
is usually presumed. However, the manner of deposition may not be as evident as it ap- 
pears. Situations when trace amounts of sperm are detected or the defendant is adamant 
about not having sexual contact with the complainant/victim can complicate the inter- 
pretations by the forensic analyst. In these instances, the question becomes the deposition 
of the semen on the clothing. In other words, was the semen deposited as a result of sex- 
ual activity or by an innocuous incident or source, such as the secondary transfer of 
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biological material during laundering? Moreover, the significance of these findings—in 
particular, any genetic profiles that are generated—may be explained in intrafamilial cases 
where biological material or DNA may be innocently deposited onto a victim’s clothing 
by various means in a shared family household. This becomes even more relevant when 
considering today’s sophisticated technology and instrumentation. For example, in 1996, 
Kafarowski et al. demonstrated that semen from a single pair of semen-stained panties 
could be transferred to previously unstained panties during laundering. In a later study, 
Noel et al. (2016) demonstrated that sufficient amounts of DNA could be transferred 
onto laundered clothing from bed sheets containing a varying number of ejaculates to 
yield complete genetic profiles. In addition, DNA from relatives living within the 
same household was detected in most cuttings from the children’s underwear. In another 
study, Brayley-Morris et al. (2015) demonstrated that complete DNA profiles could be 
obtained from laundered semen stains on clothing, even with an 8-month lag time be- 
tween semen deposition and laundering. All of these studies show that biological mate- 
rial, such as semen, can be transferred from semen-stained clothing to pristine clothing 
during laundering. However, such data could be interpreted as evidence of sexual assault 
or activity, especially when found on a child’s undergarments or on a complainant’s 
clothing who is unable to articulate the details of a specific event. Such information 
and scenarios should help forensic biologists interpret evidence from such difficult intra- 
familial cases and allow analysts to consider whether the biological material or DNA was 
directly transferred onto the clothing in question or merely transferred by innocent 
means. 


Touch DNA and wrongful convictions 


Improvements in the efficiency and accuracy of DNA typing methods, as well as the in- 
crease in the sensitivity of the instrumentation, have created new sources of testable DNA 
evidence, including touch DNA. As stated previously, touch DNA refers to the genetic 
information recovered from epithelial (or skin) cells left behind when a person makes 
contact with another person or another object. During the commission of a crime, an 
assailant can leave touch DNA behind when s/he has applied force or pressure, which 
deposits cells on a victim’s clothing or other items implicated in the crime. Touch 
DNA testing uses the same PCR technology and STR analysis to test epithelial cells 
for DNA as samples from more traditional sources of DNA. The main difference be- 
tween traditional DNA testing of bodily fluids and touch DNA is the material from 
which the DNA is collected and extracted, not the method by which the DNA sample 
is analyzed. But due to these series of technological advances, it is now possible to detect 
DNA at levels never imagined when DNA fingerprinting started in 1985. In essence, a 
crime scene investigator only needs to collect a sample with a few epithelial cells to 
generate a full DNA profile. However, this heightened level of sensitivity can create 
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misleading results, such as false positives, or reveal the profile of an individual who was 
not present at the crime scene. 

A study by Bolivar et al. (2016) showed that fingerprint brushes used at a crime scene 
could transfer DNA from one crime scene to another. This study also indicated that the 
detection of secondary transfer of DNA is enhanced using low-copy number STR typing 
procedures. This is another example where the forensic analyst will need to carefully 
consider the various scenarios as to the origin or deposition of biological material or 
DNA on evidentiary samples. 

Notable cases involving touch DNA and secondary transfer of biological evidence 
have inspired TV and movie producers to create murder mysteries based on this phenom- 
enon of DNA transfer. Stillwater, a new (2021) American crime drama film based on the 
conviction of Amanda Knox, an American college student convicted in Italy for the 
murder of her roommate, is such an example. In 2007, Knox, her boyfriend, and a third 
defendant were accused of murdering her roommate. Police suspected that Knox and her 
boyfriend held the knife against the victim while the third defendant assaulted her. To 
support this theory, police pointed to a trace amount of the boyfriend’s DNA that was 
found on the victim’s bra strap and that both Knox’s and the victim’s DNA were found 
on a knife recovered from the boyfriend’s apartment. However, the knife did not match 
the victim’s wounds. The Italian Supreme Court acquitted Knox in 2015, partly due to 
evidence contamination and a misunderstanding of how DNA transfers. This heightened 
sensitivity of cross-contamination and biological transfer can easily create false positives. 
For example, analysts are picking up DNA transferred from one person to another by way 
of an object that both individuals have touched, or from one piece of evidence to another 
by crime scene investigators or laboratory technicians, or when two items touch each 
other in an evidence bag. Knox spent 4 years in an Italian prison because of her wrongful 
conviction and another 3 years trying to clear her name. In 2015, the Italian Supreme 
Court’s report established that the third defendant was the crime’s only perpetrator. 
Although juries and judges have overcome their skepticism in certain cases, acquittals 
often come months or years after the defendant’s life was interrupted. 


A new awareness 


Since the introduction of DNA profiling, the analysis of DNA from biological evidence 
has become a daily occurrence in forensic laboratories worldwide. Over the years, the 
technology to isolate and analyze DNA has become very sophisticated and sensitive. 
Since DNA testing is sensitive enough to generate full profiles from every cell, great 
care in the handling of evidentiary samples is critical. In fact, crime scene investigators 
and forensic analysts in the laboratory handling evidence are required to adhere to strin- 
gent and well-established laboratory protocols to minimize and/or avoid contaminating 
the evidence. 
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Direct and indirect transfer of biological evidence has been shown to occur between 
individuals and inanimate objects. And because of direct and indirect transfer, biological 
material containing DNA can be isolated from a piece of evidence and the individual 
identified. However, as explored earlier, certain situations could lead to analysts’ misin- 
terpretation of the deposition of the DNA. Such “misplaced” DNA evidence and mis- 
interpretations would, no doubt, lead to the wrongful conviction of an innocent person. 

Improvements in the efficacy and accuracy of DNA typing methods have provided a 
resource of evidence for DNA testing. Touch DNA is an example whereby a person can 
make contact with another person or object during the commission of a crime and leave 
trace amounts of skin cells or DNA. Such analyses of touch DNA have led to uncount- 
able convictions; however, this heightened level of efficacy and sensitivity can create 
misleading results or reveal a profile of an individual that was not present at the crime 
scene or involved in the crime. Needless to say, as science changes and technology consis- 
tently improves, both the legal and scientific communities will need to work together in 
the interest of justice to fully appreciate the magnitude of such findings when DNA is 
detected from multiple sources. 
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Introduction 


Since its first introduction in 1985, DNA typing has evolved rapidly and has become the 
most powerful technique in criminal investigations. DNA profiles obtained from an ev- 
idence can be compared to the DNA profiles of a suspect, a victim, or other samples, or 
they can be searched in national DNA databases. This allows for linking a suspect with 
the victim, establishing the presence of an individual at a scene, and also connecting 
different crimes (An et al., 2012). Moreover, DNA has been proven useful in parentage 
testing, establishing immigration eligibility, and genealogical studies. 

Routine forensic DNA tests consist of the analysis of length polymorphisms (short 
tandem repeats, or STR) found either autosomal or in sex chromosomes (X—Y). Besides, 
mitochondrial DNA (mtDNA) analysis is also carried out to study the maternal lineage of 
an individual in sample-limiting conditions. 

These sequences differ on the basis of the total length of the repeated area, the length 
of the repeated units, and the number of units. Routinely, STR markers are amplified by 
polymerase chain reaction (PCR) using sequence-specific primers, and then PCR prod- 
ucts are separated and detected by capillary electrophoresis (CE). 

This method unfortunately has some limitations, such as the inability to analyze mul- 
tiple genetic polymorphisms in a single reaction using a single workflow, partial loss of 
information from degraded or low-quality DNA samples, and difficulties in defining 
components in a mixture (Yang et al., 2014). 

CE is also the basis of the Sanger method, which represented the gold standard for 
DNA sequencing for approximately 40 years. For it, Frederick Sanger was awarded 
the Nobel Prize in chemistry in 1980. The Sanger method, also known as first- 
generation sequencing, is based on the selective incorporation by DNA polymerase of 
chain-terminating dideoxynucleotide (ddNTp) during PCR (Sanger & Coulson, 1975, 
Sanger et al., 1977). 
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Initially, radioactively labeled ddNTPS were used for chain termination, which was 
subsequently replaced with fluorescently labeled nucleotides with the introduction of 
automatic sequencers. 

CE-based Sanger sequencing is generally used in forensics to analyze specific regions 
of mtDNA. 

Sanger sequencing is capable of producing a template sequence per reaction, and 
because of this, it is the best choice for analyzing small numbers of gene targets and 
samples. 

Anyway, the method has some limitations, such as poor quality in the first 15—40 ba- 
ses of the sequence due to primer binding, deteriorating quality of sequencing reads after 
700—900 bases, and reduced extension of the sequence that is generally limited to the 
regions of interest. 

These limitations induced forensic scientists worldwide to explore new technologies, 
and in the last few years, the introduction of next-generation sequencing (NGS) has 
greatly improved forensic DNA analysis. 


Next-generation sequencing platforms 


In the last decades, innovative technologies have been developed and have found appli- 
cation in several fields of biological research (Bruijns et al., 2018). 

NGS was introduced for commercial use in 2005 (Margulies et al., 2005). It was 
initially called “massively parallel sequencing”, because it enables simultaneous 
sequencing of many DNA strands in lieu of one strand at a time as in Sanger sequencing. 

NGS technology is used to determine DNA or RNA sequences in entire genomes or 
targeted regions. It offers high-throughput capacity, reduced costs, and high speed, 
allowing it to perform a wide variety of applications. Because of this, it has revolutionized 
the biological sciences, and it has been applied in various fields such as ancient DNA anal- 
ysis (Poinar et al., 2006), agrigenomics (Goddard & Hayes, 2009), disease diagnosis 
(McCarth et al., 2013), and forensics (Weber-Lehmann et al., 2014). 

NGS enables performing millions to billions of sequencing reactions simultaneously 
in multiple samples, as well as analyzing different types of genomic traits in a single run. 
Additional advantages of NGS include lower sample input requirements, higher accu- 
racy, and the ability to detect variants at lower allele frequencies than with Sanger 
sequencing. 

Using CE Sanger sequencing, the Human Genome Project took over 10 years with a 
cost of around $3 billion. NGS now allows the sequencing of the whole human genome 
within 3 days at a cost of around US $1000. 

Depending on the length of sequences produced second- and third-generation 
sequencing methods are available, and they are defined as short-read and real-time 
long-read technologies, respectively (Liu et al., 2012; Slatko et al., 2018). 
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Second-generation sequencing 


The NGS workflow basically includes three steps: sample preparation, nucleic acid 
sequencing, and data analysis (Kumar et al., 2019). 

There are several different strategies used to produce clonal template populations: 
emulsion, solid-state, and DNA nanoball generation. In all cases, the first step consists 
of the fragmentation of the DNA template. 

In emulsion PCR, DNA templates are fragmented and ligated to adapter sequences, 
and then they are captured inside micelles (aqueous droplets) using a bead covered with 
complementary adapters, primers, dNTPs, and DNA polymerase. The PCR reaction is 
carried out within the micelle, and each bead becomes covered with thousands of copies 
of the target DNA sequence. This strategy is used by 454 (Roche), SOLiD (Thermo 
Fisher), GeneReader (Qiagen), and Ion Torrent (Thermo Fisher). 

In solid-phase bridge amplification, after fragmentation, DNA templates are ligated to 
adapters, and then they bind with primers immobilized onto a solid support. The free 
extremity interacts with other primers, creating a bridge structure. Using PCR, replicates 
stands are produced from the immobilized primers while unbound DNA is removed. 

In solid-phase template walking, fragmented DNA templates are ligated to adapter 
sequences, and then they bind with primers immobilized onto a solid support. Through 
the PCR process, a second strand is produced. The new double-strand is partially dena- 
tured, allowing the free end of the original template to bind with another primer 
sequence. Reverse primers are used to initiate strand displacement to generate additional 
free templates, each of which can bind to a new primer. 

This strategy is used by Illumina platforms. 

In DNA nanoball generation, the DNA template is fragmented and ligated to the first 
of four adapters. The template is amplified, circularized, and cleaved. This process is 
repeated for all adapters. The final product is a circular template including four adapters, 
each separated by a template sequence. In this way, a large mass of concatamers (nanoball 
DNA, or nDNA) is generated. 

This strategy is used by Complete Genomics (BGI). 

Sequencing methods are grouped into two major categories: Sequencing by hybrid- 
ization (SBH) and sequencing by synthesis (SBS). The SBH method was developed in the 
1980s. It relies on the use of DNA oligonucleotides arrayed on filters, which are then 
hybridized into labeled fragments of the DNA to be sequenced. The method is based 
on the use of specific probes to interrogate specific sequences and allows the analysis 
of large sequences by overlapping information. It is generally used for diagnostic appli- 
cations such as the identification of single nucleotides polymorphisms (SNPs) related 
to diseases and the identification of chromosome rearrangements and abnormalities 
(Slatko et al., 2018; Church, 2006; Drmanac et al., 2002; Hanna et al., 2000; Qin et 
al., 2012). The SBS method represents a further development of Sanger sequencing. It 
involves repeated cycles of synthesis to incorporate additional nucleotides into the 
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growing chain. SBS approaches include cyclic reversible termination (CRT) or single- 
nucleotide addition (SNA). CRT uses terminator molecules (3’-OH group blocked) 
for preventing elongation. In SNA each of the four nucleotides is added iteratively to 
ensure only one dNTP is responsible for the signal. Thus, the absence of the next nucle- 
otide in the reaction prevents elongation (Goodwin et al., 2016). Current SBS methods 
differ from the original Sanger sequencing because they rely on shorter reads with high 
sequence coverage (“massively parallel sequencing”) of millions to billions of short DNA 
sequence reads (50—300 nucleotides). The most commonly diffused next-generation 
technologies are based on SBS approaches, such as the 454, Ion Torrent, and [lumina 
platforms. 

The first NGS method developed was Roche 454 pyrosequencing commercialized in 
2007. Pyrosequencing is a real-time sequencing alternative to the Sanger method (Rona- 
ghi et al., 1996). It is based on the detection of pyrophosphate release and the consequent 
generation of light after nucleotide incorporation. As a dNTP is incorporated into a strand, 
an enzymatic cascade occurs, producing a bioluminescence signal that is detected by a 
charge-coupled device camera. By reading optical signals it is possible to identify which 
base is incorporated. Pyrosequencing is cheaper and faster than Sanger sequencing 
(Andreasson et al., 2002; Goodwin et al., 2016; Margulies et al., 2005). Using the Genome 
Sequencer FLX System, the method was able to sequence 400—600 megabases in 10 h. 

In 2010, Applied Biosystems began the commercialization of the Ion Torrent, a non- 
optical sequencing platform with integrated circuits based on the use of a semiconductor 
ion. This method relies on the iterative addition of each of the four dNTPs to a 
sequencing reaction to ensure only one nucleotide is responsible for the signal. The 
absence of the next nucleotide stops the sequencing reaction preventing elongation. 
During DNA synthesis when a new dNTP is incorporated, a proton is released (H"), 
and there is a small change in pH that is detected by a sensor (a field-effect transistor). 
This permits the identification of the bases (A, C, G, and T) incorporated into the chain. 
The chemical signals are translated into digital information. Different types of chips are 
available on the Ion Torrent platforms, enabling complete sequencing in less than 2 h. 
Currently, two Ion Torrent instruments are available: The lon GeneStudio $5 System 
and the Ion Torrent Genexus System. 

Illumina is the most widely used sequencing technology because it offers effective 
platforms capable of guaranteeing correct results, from small, low-throughput benchtop 
instruments to large, ultrahigh-throughput instruments for whole-genome sequencing. 
The system uses terminator molecules that are similar to those used in Sanger sequencing. 
dNTPs are labeled with a base-specific, cleavable fluorophore and blocked by a 3/-O- 
azidomethyl group. During PCR, all four dNTPs are added, and when a single dNTP 
is incorporated it reversibly blocks chain elongation. The color emitted by TIRF (total 
internal reflection fluorescence microscopy using two or four laser channels) allows us 
to identify which nucleotide was incorporated in each cluster. The dye is then cleaved, 
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and 3/-OH is regenerated with the reducing agent tris-phosphine. After the removal of 
the fluorophore and blocking group a new cycle can begin. 

The first Illumina platform was launched in 2007, and it was called Genome Analyzer. 
Initially, it sequenced 6 million amplified fragments per lane but generated only around 
30 base reads, but then it was improved to an 80-gigabase output. 

HiSeq was the second instrument released by Illumina in 2010, followed by the 
HiSeq X10, which was able to further increase the number of fragments that could be 
analyzable. Today, the NextSeq and NovaSeq machines are the most diffused, along 
with some benchtop sequencers (iSeq 100, MiniSeq). 

In 2012, Qiagen bought the Intelligent BioSystems CRT platform, which was 
commercialized and relaunched in 2015 as the GeneReader 22, which is an all-in-one 
platform able to perform all steps from sample preparation to analysis. Because of this, 
the GeneReader system is connected with the QIAcube sample preparation system 
and the Qiagen Clinical Insight platform for variant analysis. 

The GeneReader uses the same approach used by Illumina: dNTPs are blocked by a 
3-O-allyl group, and some of the bases are labeled with a base-specific, cleavable fluoro- 
phore. Unincorporated bases are washed away, and the sample is imaged by TIRF using 
four laser channels. The dye is then removed, and the 3’/-OH is regenerated by adding a 
mixture of palladium and P(Cg,.H4SO3Na)3 (TPPTS). 

The GenapSys Sequencing Platform is a small, fast, and cheap desktop instrument. It’s 
an alternative NGS system useful for the analysis of small panels due to its affordability 
and ease of use. The system uses complementary metal oxide semiconductor sequencing 
chips with millions of sensors (electrodes) on their surface that detect differential electrical 
changes caused by nucleotide incorporation. It is used for many applications, such as tar- 
geted sequencing, small genome sequencing, gene editing validation, RNA sequencing, 
and targeted single-cell assay sequencing. 

Some problems related to second-generation sequencing, such as the difficulty of reli- 
ably resolving large genomic rearrangements or the complexity in data assembly, were 
overcome by the development of third-generation sequencing. 


Third-generation sequencing 


Third-generation NGS is a group of DNA sequencing methods that were first described 
in 2009 and are still under active development. The aim of third-generation sequencing is 
to sequence longer DNA (and RNA) sequences (>1 kb to 2 Mb). The approach is based 
on single molecule reading in real-time, which allows the generation of long reads faster 
than with previous sequencing techniques since clonal amplification steps are eliminated 
(Ozsolak, 2012). 

These methods can be grouped into three categories: the first category is where single 
polymerase molecules are monitored during DNA synthesis (e.g., PacBio); the second 
category is based on the transport of DNA molecules through a nanopore membrane 
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(e.g., Oxford Nanopores); and the third category consists of methods that use direct im- 
aging of individual DNA molecules by advanced microscopy (e.g., VisiGen). (Schadt 
et al., 2010; Stoloff & Wanunu, 2013; Thompson & Milos, 2011). 

PacBio sequencing, also known as single-molecule real-time (SMRT) sequencing, 
enables the analysis of very long fragments (up to 30—50 kb). The SMRT method is 
based on the binding of the DNA to be sequenced with DNA polymerase at the bottom 
ofa well in an SMRT flow cell, and the DNA strand to be sequenced progresses through 
it. Imaging occurs only at the bottom of the well when fluorescently labeled dNTPs are 
incorporated by DNA polymerase. The four nucleotides are labeled with different fluo- 
rophores so that each base is identified as it is incorporated into the growing DNA chain. 
Imaging is related to the rate of nucleotide incorporation. The incorporation simulta- 
neously occurs in parallel in up to 1 million wells on a single chip within the SMRT cells. 
This technique is fast, and longer reads can be generated than with previous generations 
of sequencing techniques (Van Dijk et al., 2014). 

In 2014, Oxford Nanopore Technologies released the MinION platform based on 
the transport of DNA through a nanopore membrane embedded in a lipid bilayer or a 
synthetic polymer (Mikheyev & Tin, 2014). The detection principle of nanopore systems 
is based on the ion current changing, which is generated when a charged molecule passes 
through the membrane. When the electric field is applied DNA molecules pass through 
the pore, and the current of other ions is partly blocked. The decrease in current ampli- 
tude is used to determine the nucleotide sequence of the DNA passing through the 
membrane since each of the four different dNTPs produces distinct current levels 
(Brujjns et al., 2018; Haque et al., 2013; Schneider & Dekker, 2012). With a nanopore 
system, a single DNA molecule can be read rapidly without the need for PCR amplifi- 
cation or the use of expensive fluorescent dye labels. 

The VisiGen Biotechnology approach is based on fluorescence resonance energy 
transfer (FRET) and the use of a special DNA polymerase, which acts as a ‘real-time 
sensor’ for modified nucleotides. When fluorescently labeled dNTPs are incorporated 
into a single DNA strain, a FRET occurs. The frequency of the FRET signal varies 
depending on the label incorporated in the nucleotides, so it is possible to determine 
base sequences. This approach, not requiring cloning and amplification, reduces the 
analysis cost and produces ability distinguishing long reads (around 1 kb) in a short 
time (Pareek et al., 2011) (Table 3.1, Table 3.2). 


NGS forensic applications 


As technology advances, the NGS method shows huge promise to become a routine 
method in forensic applications. NGS allows examiners to generate data that spans the 
human genome in a single, targeted assay. Such a test may be useful in many aspects, 
as described below. 


Overview of NGS platforms and technological advancements for forensic applications 


Table 3.1 Main advantages and disadvantages of NGS second and third generation. 


Advantages Disadvantages 


Second-generation High accuracy Short sequencing reads 
NGS Ability to sequence (between 200 and 300 bases) 
fragmented DNA Difficulty in resolving 
Low cost structural variants or 
distinguishing highly 
homologous genomic regions 
Not suitable for sequencing 
traits containing large 
numbers of repetitive 


sequences 
Third-generation NGS | Very long sequence reads Weak signals from individual 
Ease library preparation fragments 
Good identification of Overall lower accuracy 


epigenetic markers 
Portable technologies 


Criminal casework: Using NGS, forensic laboratories can analyze multiple sets of 
forensic-relevant markers (STR and/or SNPS) in a single assay. This allows for obtaining 
a wide range of information from degraded, scarce, or mixed DNA samples from cold 
cases. 

Forensic DNA databases: Each year millions of DNA profiles are stored in national 
DNA databases. NGS can help forensic laboratories to produce high-quality forensic 
DNA profiles faster than with the CE method. 

Disaster victim identification (DVI): In the case of mass disasters, it is relevant to 
obtain as much information as possible from highly compromised samples in a short 
time. NGS greatly helps in this regard because of its ability to analyze simultaneously 
hundreds of markers in a single test. 

Missing person identification: In missing person identification cases samples are often 
compromised. NGS allows sequencing of both nuclear and mtDNA on a single platform, 
overcoming difficulties encountered with human remains samples. 


Short tandem repeats and single nucleotide polymorphism analysis 


STR analysis is essential in criminal casework because of the large national DNA data- 
bases, which include STR profiles from criminal offenders, evidence, missing persons, 
and samples from old cases. Consequently, forensic NGS assays must be able to sequence 
STR loci (Borsting et al., 2014). STR analysis represents the most commonly used 
method in forensics due to its multiple advantages, such as multiplex amplification, pre- 
cise allele discrimination, and low DNA template requirements. However, simultaneous 
detection of a large number of STR markers is difficult due to the technical limitations of 
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Table 3.2 Summary of common NGS platforms. 


Platforms 


454/Roche FLX (2008) 
Tllumina/Solexa (2008) 
ABI/SOLiD (2008) 


Illumina MiSeq (2019) 


Oxford Nanopore Minlon 
(2014) 
Illumina NovaSeq (2019) 


Ion Torrent (2019) 

BGI MGISEQ-T7 (2019) 

Pacific biosciences SMRT 
(2019) 


Technology 


Pyrosequencing 
Reversible dye terminator 
Oligonucleotide 8-mer 
chained ligation 
Reversible dye terminator 


Reversible dye terminator 


Native dNTPs, ion detection 

Nanoball technology 

Phospholinked fluorescent 
nucleotides 


Analysis time 
(average) 


Read length 
(average) 


200-300 bp 
30-100 bp 
20-45 bp 


2x 75 
—2 x 300 bp 
13—20 kb 


2 x 50 

—2 x 150 bp 
200—600 bp 
2 x 150 bp 
10—30 kb 


Throughput Mb/h 
(average) 


20—30 

20 

5-15 

170—250 

700 
22,000—67,000 
110-920 


250,000 
1300 


cv 
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fluorescent-based CE technology (Yang et al., 2014). NGS technology shows many po- 
tential advantages, such as high throughput, the capability for simultaneous detection ofa 
wide range of autosomal and sexual STR loci, and the ability to distinguish alleles with 
similar lengths. 

Routinely, Y-STRs are used to identify the male component of DNA mixtures, 
especially in sexual assault cases where a high female background is present, or to recon- 
struct paternal relationships through the paternal lineage. The ability to analyze simulta- 
neously autosomal and sexual STRs greatly increases the analysis efficiency and reduces 
the cost while allowing better identification, especially in the case of mixed samples. 
Furthermore, it has been shown that Y chromosome sequencing could help in distin- 
guishing between mixed male samples from the same male lineage (Yang et al., 2014), 
since in a study by NGS on two men who had the same ancestor 13 generations ago, 
around 10 million nucleotides of the Y chromosome were sequenced (Xue et al., 
2009), and four genetic differences were found. 

In order to standardize the nomenclature used for describing STR complex se- 
quences obtained by NGS and to enable comparison with routinely used nomencla- 
ture derived from CE systems and with data stored in the national DNA databases, the 
DNA Commission of the International Society for Forensic Genetics has developed 
some recommendations (Parson et al., 2016). SNPs are useful in forensics due to their 
abundance in the human genome, their low mutation rate, and their smaller size than 
STRs, which increase the chance of success when analyzing degraded samples. Because 
of this, they are useful in human identification, ancestry studies, and phenotyping pre- 
diction (Sobrino & Carracedo, 2005). Furthermore, the high throughput of NGS can 
assist in targeting SNP sets not routinely available with traditional CE-based methods. 
In fact, NGS, in comparison with traditional methods (SNaPshot or array systems), al- 
lows us to analyze a larger number of markers simultaneously even from low DNA 
quantities(Seo et al., 2013). In addition, it’s possible to target microhaplotypes that 
have shown promising in forensics for identification and ancestry purposes (Pakstis 
et al., 2012). NGS technologies have led to the discovery of millions of SNPs, and 
several custom panels of large SNP assays have been developed by several authors 
(Wang et al., 2019; Zhang et al., 2017). 


Forensic phenotyping and ancestry studies 


Forensic phenotyping includes the prediction of externally visible characteristics (e.g., eye 
color, hair color, and skin color) and biogeographic ancestry, that is, the geographical 
origin of an individual and the person’s age. Tests for the identification of these characters 
have been developed and validated, along with adequate predictive statistical models. 
This allows mass screening when there are no suspects or matches with profiles in 
DNA databases; it’s useful in cases of identification of human remains (DVI cases) 
(Shneider Peter et al., 2019). 
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In 2011, Walsh from Erasmus University developed the Iris Plex, the first assay for ac- 
curate prediction of blue and brown eye color, followed in 2013 by the HIrisPlex system 
for simultaneous prediction of hair and eye color with an accuracy of 90% (Walsh et al., 
2011, 2013). In 2018, Walsh developed the HIrisPlex-S system for the analysis of 41 
SNPs for the joint prediction of the color of the eyes, hair, and skin. Blue and brown 
eyes are predicted more accurately than eyes that are neither brown nor blue; red and 
black hair more accurately than blonde and brown; and dark skin colors more accurately 
than light (Chaitanya et al., 2018). The prediction of facial shape is one of the main chal- 
lenges when studying phenotyping to obtain the final “DNA facial composite”. Some 
genetic markers associated with facial features are correlated with craniofacial develop- 
ment and, consequently, linked to normal variation in facial shape; others are related 
to facial syndromes and deformities and disease studies (such as cleft palate, cleft lip, 
and other craniofacial dysplasia). However, the human face is a complex feature 
composed of different characteristics (for example, eye shape and distance between 
eyes, nose, and mouth shape), whose development involves molecular and environ- 
mental interactions that are not fully understood. 

In 2009, Klimentidis (Klimentidis & Shriver, 2009) started to study facial features by 
DNA test and association analysis and validated their results using facial reconstruction 
molecular photofitting. By joining information about genotype, sex, ancestry, and the 
effects of particular alleles on facial features, information can be discovered (Claes 
et al., 2014). In 2020, Liu found 12 SNPs from 10 genes associated with one or more 
facial shape traits and was able to provide reference data for DNA-based face prediction 
in the Han Chinese population (Liu et al., 2020). Since forensic DNA phenotyping 
(FDP) goes beyond standard forensic DNA typing, its legalization is under debate in 
different countries due to ethical issues (Shneider Peter et al., 2019). 

In 2017, the VISible Attributes through GEnomics (VISAGE) Consortium was 
created as a collaboration between 13 partners from academic, police, and justice institu- 
tions belonging to 8 European countries. VISAGE received funds from the European 
Union’s Horizon 2020 with the aim to overcome the general limitation of current 
forensic DNA analysis by developing new technologies able to provide new tools in addi- 
tion to conventional DNA analysis, such as predicting the appearance and age of an ev- 
idence donor (http://www. visage-h2020.eu/) in order to obtain from DNA a sketch of 
unknown perpetrators. 

An ancestry study is based on the genetic characteristics that a person has inherited 
from their biological ancestors. The more the distance between the geographical regions 
of origin of two people, the more the genetic differences between them. These differ- 
ences are due to mutations, migrations, local selection, isolation, and heredity. Because 
of this, some DNA markers are very common in particular geographic regions and 
rare in others. Markers located in the autosomes are inherited from both parents and, 
therefore, reflect their geographical region of origin. Markers placed on the Y 
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chromosome are transmitted only from father to son and therefore exclusively reflect the 
geographical origin of the ancestors through the male lineage. mt DNA markers are trans- 
mitted only from mother to child and therefore exclusively reflect the origin of the an- 
cestors (male or female) through the female (maternal) lineage. Because of this ancestry 
analysis involves the study of all these ancestry informative markers (AIMS) and a predic- 
tion based on the maximum likelihood obtained using a classification algorithm (Elaine 
et al., 2018; Phillips et al., 2007, 2014). For example, in the 2004 Madrid terrorist attack, 
a suspect was identified by inferring the original population using 34 autosomal SNPs 
related to the ancestry of the population (Phillips et al., 2009). 


Epigenetic analysis 

Epigenetics consists of the study of external not heritable factors that may temporally 
modify genes impacting their expression, i.e., chemical reactions due to expositure to 
the environment or as a consequence of a diet or lifestyle. Epigenetic markers have 
several applications in forensic science, for example, for age prediction, tissue identifica- 
tion, and monozygotic (MZ) twin differentiation. Aging is strongly correlated with 
changes in DNA methylation (Bocklandt et al., 2011; Frumkin et al., 2011). The molec- 
ular estimation of age is based on the fact that the activity of some genes changes during 
aging. One type of age-dependent gene regulation is the increasing or decreasing degree 
of cytosine methylation in CpG dinucleotides in the promoter regions of certain genes 
(johnson et al., 2012). 

In forensic investigations, the ability to predict the biological age of the donor of a 
sample can provide relevant information in cases where no suspects or matches in the 
DNA database are found (Freire-Aradas et al., 2017). From a clinical point of view, it’s 
useful to estimate the life expectancy of an individual (Horvath, 2013; Horvath & Raj, 
2018). DNA methylation analysis available to date allows for estimating age with an 
average error of +4—5 years (Freire-Aradas et al., 2020; Alvarez-Dios et al., 2020). 
Epigenetic markers have been found also to be useful to distinguish the type of biolog- 
ical material found at crime scenes. In fact, in some cases, traditional serological methods 
used for identifying body fluids are not enough due to low sensitivity (especially in the 
case of trace samples) and the inability to distinguish, for example, menstrual blood from 
regular blood (Titia Sijen & Harbison, 2021). An alternative approach is to utilize pat- 
terns of epigenetic DNA methylation (Lee, 2015; Park, 2014), since genome-wide 
methylation allows the identification of variants on the methylation degree of CpG 
sites, useful to distinguish different tissues (Forat et al., 2016; Park et al., 2014). 

Epigenetic approaches based on NGS technology include whole-genome bisulfite 
sequencing, methylation bead chips, and immune-precipitation sequencing (Xu et al., 
2012). Since it’s crucial to be able to identify trace samples, it has been found that low 
DNA amounts (100 pg) are successfully analyzed through genome-wide amplification 
of a bisulfite-modified DNA template, followed by quantitative methylation detection 
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using pyrosequencing (Paliwal et al., 2010; Xu et al., 2012). The ability to differentiate 
MZ twins is another relevant topic in forensics useful for the correct identification of 
the author ofa crime ora victim. Since twins have the same DNA, the analysis of auto- 
somal and sexual STRs, SNPs, or mtDNA cannot differentiate them. Methylation 
studies have been found useful to solve this issue: For 65% of the pairs the twins 
show almost identical epigenetic profiles, but for the remaining 35% there are signifi- 
cant differences. In 2010, the Beying Genomics Institute and TwinsUK, the twin 
research group based at King’s College London, started a research project to study by 
NGS technology in 5000 pairs of twins the methylation patterns of 20 million sites 
looking for differences in epigenetic signals useful to explain why many twins don’t 
develop the same features or diseases (Yang et al., 2011). In 2013, 92 significantly meth- 
ylated CpG sites were found, representing the potential targets for distinguishing be- 
tween MZ twins (Li et al., 2013). Similarly, in 2014, extremely rare mutations were 
analyzed to differentiate between MZ twins by ultradeep NGS (Weber-Lehmann 
et al., 2014). 


RNA analysis 


DNA analysis allows the identification of an individual but does not give contextual in- 
formation about the sample. In the last few years, RNA has been found useful for several 
applications. Several studies have been carried out to understand if gene expression can be 
useful for solving this issue. RNA types such as mRNAs and microRNAs have been 
shown to be tissue-specific, and they can provide high specificity for species identification 
and body fluid identification. Markers have been identified for the forensically most rele- 
vant body fluids (i.e., blood, saliva, semen, vaginal secretion, and menstrual blood). Addi- 
tionally, it has been found that the mechanism of RNA degradation can give important 
information about sample deposition time (Ingold et al., 2018; Simard et al., 2012; Zuba- 
kov et al., 2008) and the determination of postmortem interval (Courts & Madea, 2010). 
Although RNA is highly susceptible to degradation by environmental RNAses, the use 
of NGS technology can greatly improve the analysis because millions of RNA sequences 
can be rapidly analyzed (Hanson et al., 2018). 


Microbial forensics 


For over 100 years, microbiology played a minor role in forensic science (MacCallum & 
Hastings, 1899), and only in the early 2000s, with the rise of bioterrorism was so-called 
“microbial forensics” born. 

Microbial forensics is a discipline developed by the Federal Bureau of Investigation 
after the Anthrax attack in the USA on September 18, 2001, due to the serious conse- 
quences produced by microbiological terrorist attacks. It consists of the fast and accurate 
detection and identification of microorganisms with the aim of tracing the source of the 
microbes (Beecher, 2006; McEwen et al., 2006). 
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Microorganisms are highly different and ubiquitous, and they can be found every- 
where. They are abundant in the human body and inside it, in the environment, and 
on objects (Oliveira & Amorim, 2018; Sender et al., 2016; Vazquez-Baeza et al., 
2018). Because of this, microbial profiles could be used to complement traditional crim- 
inal investigations (Jake et al., 2021) for establishing sample provenance, helping identify 
individuals, determining the cause of a death, and estimating PMI (Jake et al., 2021; Phan 
et al., 2020; Schmedes et al., 2017). Since NGS has the advantages of high throughput, 
multiplexing capability, and accuracy, it is suitable for rapid whole-genome typing of mi- 
croorganisms in forensic or epidemiological investigations, allowing the identification of 
different bacterial taxa and strains (Brenig et al., 2010; Fierer et al., 2010; Giampaoli et al., 
2014; Lilje et al., 2013; Wooley et al., .2010). 


Mitochondrial DNA sequencing 


For years sequencing of mtDNA has been used in forensic casework especially for the 
analysis of scarce or degraded samples (i.e., teeth, ancient bones, and hairs without roots) 
when STR analysis does not provide a full STR profile (Borsting & Morling, 2015; 
Parson, 2004, 2007). Mt-DNA has a higher copy number per cell than nuclear DNA, 
so it can survive environments where nuclear DNA cannot. Because of this, it can be 
a powerful tool for human identification. NGS provides deep coverage to obtain genetic 
data from small forensic mtDNA samples in not ideal conditions. Other advantages 
consist of high sensitivity, increased heteroplasmy detection, and a feasible method 
(Eduardoff et al., 2017). Different PCR strategies exist for amplifying the mitochondrial 
genome: two long segments for whole genome amplification of good-quality DNA, 
multiple small fragments (~ 2 kb) for whole genome amplification of compromised sam- 
ples, and many overlapping small amplicons for amplification of either the control region 
or entire genome of degraded samples (Lyons et al., 2013; McElhoe et al., 2014; Parson 
et al., 2015). Multiplexes are available commercially for amplifying either the control re- 
gion or the entire mitochondrial genome in 100—400 bp amplicons (Thermo Fisher Sci- 
entific, Promega). 

The ability to analyze the entire mitochondrial genome significantly increases the 
discriminatory power in comparison with control region analysis alone, allowing for 
maximal resolution of matrilineal geographic ancestry. Additionally, for severely 
degraded samples, probe capture-NGS methods have been shown to provide results 
with old or extensively damaged mtDNA (King et al., 2014; Peck et al., 2016; Temple- 
ton et al., 2013). 


Animal and plant DNA analysis 

Species determination represents a relevant tool in criminal investigations, for example, in 
illegal animal traffic, in the food industry, or in mass disasters to distinguish human re- 
mains from animal remains (Yang et al., 2014). Routinely, species identification is 
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performed using species-specific PCR primers. However, in many cases, no a priori spe- 
cies information is available. In these cases, NGS technology allows identification by 
sequencing the entire genome and comparing it with a species’ reference genome. In 
addition, NGS is useful in evolutionary studies to determine the genome of extinct or- 
ganisms (e.g., Mammut), in veterinary for the identification of genes responsible for the 
development of diseases or for disease resistance, and in studies about animal breeding 
(Dunistawska et al., 2017). 

In 2012, the 1000 Bull Genomes Project was created with the aim of comparing 
whole-genome sequences of cattle by NGS. It includes 40 international partners, 2700 
dairy and beef animals, and has identified close to 90 million genetic variants (Hayes & 
Daetwyler, 2019). In 2018, the US Department of Agriculture-National Animal Health 
Laboratory Network started a project aimed at evaluating new technologies and devel- 
oping standardized methods for improving studies about animal diseases (Harris et al., 
2021). In addition, NGS may have a great impact in agrigenomics because it allows to 
perform rapidly and in a cost-effective manner plant whole genome sequencing, exome 
or transcriptome sequencing (Fullwood et al., 2009), so to discover genetic markers 
involved in plant development, tolerance to stress, or medicinal plant breeding, and 
also for the discovery of domestication genes in crop plants and their wild relatives (Hen- 
ry, 2012; Sing et al., 2015). Using NGS the genome sequences of several plant species 
have been completed, and some plants and animals in traditional Chinese medicines 
have also been studied (Yang et al., 2014). Due to their great abundance and variability, 
SNPs are currently the marker of choice in plant genetic research and breeding. The first 
plant sequenced was Arabidopsis thaliana, followed by rice (Kaul et al., 2000; Yu et al., 
2002) (Table 3.3). 


Forensic NGS commercial systems 


In 2015, a review article was published in which NGS technology was called “the future 
of forensic DNA analysis” (Borsting & Morling, 2015). The use of NGS in forensic ge- 
netics has been debated, but in the last few years, specific applications for human 


Table 3.3 Main NGS applications in forensics (alphabetical order). 
Main NGS applications in forensics 


Animal and plant DNA analysis 

Epigenetic analysis 

Forensic phenotyping and ancestry studies 

Microbial forensics 

Mitochondrial DNA (Mt-DNA) sequencing 

RNA analysis 

Short tandem repeats (STRs) and single Nucleotides polymorphisms (SNPs) analysis 
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identification, determination of phenotypic traits, and mtDNA analysis have been devel- 
oped and tested in many forensic and research laboratories (Alvarez-Cubero et al., 2017; 
Ballard et al., 2020; King et al., 2014). Initially, the introduction of forensics was slow, 
mainly due to a lack of accredited sequencers and kits and relatively high error rates in 
comparison with Sanger sequencing. Currently, a majority of the problematic issues 
have been solved (Ballard et al., 2020). The technology of NGS was initially mainly 
used for the study of SNP markers and mt-DNA, but nowadays also autosomal STR 
markers are analyzed. 

In 2014, Thermo Fisher Scientific launched two SNP typing assays designed for the 
Ion PGM'™ System: the HID-Ion AmpliSeq'™ Identity Panel for human identification 
that amplifies 124 autosomal SNPs, including most of the SNPforlID and individual iden- 
tification SNPs (IISNPs), and 34 Y-chromosome SNPs, and the HID-Ion AmpliSeq'™ 
Ancestry Panel (Xavier et al., 2020) for ancestry estimation that includes most of the 
AIMs (Eduardoff et al., 2015; Guo et al., 2016; Nievergel et al., 2013; Sanchez et al., 
2006). Thermo Fisher Scientific aims to develop assays to be used as a supplement to 
PCR-CE typing and to be effective with scarce samples. With the improvement of 
sequencing technology, the HID Ion GeneStudio S5 Series sequencers have been 
launched that allow using a little DNA quantity (as 100 pg) to obtain accurate and repro- 
ducible results. Currently, the Precision ID GlobalFiler NGS STR Panel v2 includes the 
same CE-STR markers available in the GlobalFiler PCR Amplification Kit, and it com- 
bines high compatibility with database core loci and superior discrimination power. 

The Precision ID Ancestry Panel includes 165 autosomal markers (SNPs) that can 
provide biogeographic ancestry; the Precision ID Identity Panel uses 34 upper Y- 
Clade SNPs and 90 autosomal SNPs for enabling DNA typing from degraded or chal- 
lenging forensic samples. The Ion AmpliSeq PhenoTrivium Panel contains DNA 
markers for phenotyping, ancestry, and male lineage, for a total of 200 autosomal 
SNPs and 120 Y chromosomal SNPs in a single assay. The lon Ampliseq DNA Pheno- 
typing Panel is used to predict hair and eye color and includes 24 SNPs from the HIris- 
Plex system developed by Susan Walsh from Erasmus University (Walsh et al., 2013). 

The Visage Consortium developed a Basic Tool for Appearance and Ancestry predic- 
tion from DNA that represents the first FDP laboratory tool that combines DNA analysis 
of the eye, hair, and skin. It allows the analysis of 153 SNPs into a single multiplex re- 
action using the AmpliSeq design pipeline and is applied for massively parallel sequencing 
with the Ion S5 platform. They included markers from IrisPlex, HIrisPlex, and 
HIrisPlex-S systems (Chaitanya et al., 2018; Walsh et al., 2011, 2013). 

For biogeographical ancestry (BGA), the most AIM-SNPs were taken from the 
EUROFORGEN Global AIMs MPS ancestry panel (Eduardoff et al., 2017; Phillips 
et al., 2014), supplemented with two additional AIMs SNPs (rs10497191, 136990312) 
from the 55 AIMs panel developed at Kidd laboratory (Kidd et al., 2014a, 2014b). Further- 
more, 11 additional AIM-SNPs included in the Thermo Fisher Precision ID ancestry panel 
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were added. The panel has high sensitivity and is able to process low or degraded casework 
samples, obtaining full SNP profiles with DNA amounts down to 100 pg. 

In 2017, Illumina and Telegraph Hill Partners established Verogen, Inc., which is the 
only global organization dedicated to the development and supply of NGS-based human 
identification products. The MiSeq FGx Sequencing System represents the first NGS in- 
strument developed and validated for forensic use. It is a platform able to prepare and 
sequence libraries and analyze data in a single workflow designed for a wide range of ap- 
plications, including FDP and forensic genetic genealogy. The MiSeq FGx Forensic Ge- 
nomics System is in use in operational crime laboratories, private service labs, and forensic 
research institutes worldwide. The ForenSeq Kintelligence Kit has been developed for 
forensic purposes, and it enables the analysis of 10,230 forensically relevant markers in 
8 h with less than 2 h of hands-on time. 

The ForenSeq MainstAY Kit simultaneously analyzes 27 autosomal STRs and 25 Y- 
STR markers with as little as 100 pg, and it’s compatible with all existing STR databases. 
The ForenSeq DNA Signature Prep Kit amplifies 27 autosomal STRs, 8 X-STRs, 25 Y- 
STRs, 95 autosomal human identification SNPs, 56 autosomal BGA-informative AIMs, 
and 24 autosomal SNPs associated with phenotypic traits in a single reaction. The multi- 
plex includes, among others, all of the STR loci in the CODIS and European standard 
sets, most of the SNP for ID, and all of the HIrisPlex loci. With over 200 markers in a 
single multiplex, it eliminates the need to run multiple STR tests, reducing time and 
costs. 

It is the only NGS chemistry approved for uploading STR profiles to the National 
DNA Index System in the USA. The ForenSeq mtDNA Whole Genome Kit and 
mtDNA Control Kit allow successful sequencing of the whole genome and the control 
region of mtDNA inforensic samples even if compromised with minimal DNA input. 
The forensic laboratory for DNA analysis at the University of Leiden represents the first 
laboratory in the world accredited to use NGS technology in forensic DNA analysis. A 
survey conducted in 2017 among 33 European forensic laboratories showed that at least 
52% of them had already bought one or more NGS instruments: The MiSeq R/NextSeq 
(Iumine) and the Ion Torrent PGM/S5 (Thermo Fisher) are the most diffused. 


NGS advantages and challenges 


NGS using a whole genome or targeted approachs allow simultaneous analysis of a large 
number of markers such, as STRs and SNPs in parallel with targeted mRNA and small 
RNA analysis. This makes NGS a very powerful tool in forensic laboratories (Ballard et 
al., 2020). Traditional STR analysis by CE allows us to determine the repeat number of 
the STR on the basis of the PCR amplicon size. NGS analysis can determine the full 
sequence of the PCR product, including the STR repeat region and the flanking areas, 
providing a precise description of the repeat allele structure and all variants in the 
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surrounding area. Since repeat structures can be complex (due to the presence of inser- 
tions, deletions, substitutions, rearrangements, or SNPs in the flanking regions), this type 
of approach results in increased allelic variability for many STRs (Ballard et al., 2020; 
Devesse et al., 2017; Gettings et al., 2015), producing an improvement in the markers 
discrimination power. 

In the case of DNA mixtures, the improvement in marker discrimination is a great 
benefit for identifying extra sequence-specific alleles that may be masked in the case of 
identical CE amplicon size. In addition, the increased discrimination reduces the risk 
of an adventitious match between the alleles of the alleged donor and the alleles of the 
mixed sample. In this way, it becomes easier to distinguish different DNA components 
in a complex mixture. The utility of STR flanking variants for mixture analysis has been 
reported in studies about deletion/insertion polymorphisms present in the flanking re- 
gions of STRs (DIP-STRs) (Phillips et al., 2007). Furthermore, it has been found that 
by NGS analysis, mixture proportions are accurately reflected by read numbers (Devesse 
et al., 2017). 

One notable advantage of NGS analysis in comparison with CE is that separation by 
amplicon size is not required for multiplex STR assay design. STRs can be amplified us- 
ing the smallest amplicon length, which improves analysis success in cases of compro- 
mused samples. Furthermore, the number of markers that may be amplified 
simultaneously is not constrained by the fluorescent-dye detection systems that allow 
coamplification of large panels of markers. In this way, amplification is not limited to 
STR loci, and SNP markers can also be analyzed in combination with STRs. 

The advantages of analyzing SNPs for forensic samples with NGS in comparison with 
SNaPshot or SNP array systems consist in the fact that it is possible to analyze large 
numbers of markers simultaneously from low quantities of DNA (Daniel et al., 2015; 
Grandell et al., 2016; Kidd, et al., 2014a, 2014b; Pakstis et al., 2012). It is also possible 
to detect sets of SNPs located in very close proximity to each other on a chromosome 
(microhaplotype) that have been shown to be useful in forensics for both identification 
and ancestry purposes (Wendt et al., 2017). Traditionally, simultaneous analysis of mul- 
tiple nucleic acids (DNA and RNA) or different DNA markers (STR and SNPs) has not 
been utilized in forensics. 

When the sample quantity is enough the protocol generally includes consecutive 
RNA and DNA extractions or alternatively divides the samples for separate isolations 
(Ballard et al., 2020), and then different analyses are performed. NGS protocols enable 
simultaneous analysis of DNA and RNA, which in the last decade has been shown to 
also have forensic relevance, for example, in the case of tissue and body fluid identifica- 
tion, determination of time of death, etc. 

The main challenge for the application of NGS in forensics consists of the quantity of 
input samples required for the analysis because forensic samples are notoriously compro- 
mused. Initially, large amounts of DNA were required since the reduced concentration of 
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nucleic acids greatly influenced the quality of the library preparation and sequencing 
choice (Comelis et al., 2017). Nowadays, recent improvements allow the successful uti- 
lization of low DNA input (Ballard et al., 2020). With the progressive reduction of the 
DNA amount required for preparing NGS libraries consequently a wide range of forensic 
samples could be analyzed to obtain large pieces of information. Another challenge is 
that, depending on the NGS platform, some factors are variable (e.g., read lengths, system 
flexibility, runtime). Considerable improvements in read length have been achieved in 
the pyrosequencing and SBS platforms. Another relevant variable is the overall 
sequencing capacity, or the total number of clusters (or reads) that are sequenced per run. 

A high degree of flexibility in the experimental design is possible on the SBS and 
semiconductor platforms, which makes them suitable for experiments where the number 
of samples or the sizes of the analyzed regions are variable from one experiment to 
another. Finally, in pyrosequencing and the semiconductor platforms the runtime is 
shorter than in the SBS and sequencing by ligation platforms because in the first one, 
signal detection is performed in real-time, while in the latter, the least is done by imaging. 
Finally, the differences among NGS platforms in terms of data format, length of reads, 
etc. result in the need for diversity in bioinformatics tools to be used for sequence quality 
scoring, alignment, assembly, and data processing (Wold and Myers, 2008; Yang et al., 
2009; Pop and Salzberg, 2008). 

Generally, in an NGS run tens or hundreds of short reads are produced, which result 
in terabytes of raw data. Because of the complexity of data analysis, a variety of software 
for NGS data analysis has been developed, with several tools available online. Their func- 
tions consist mainly of alignment of reads to a reference sequence; de novo assembly; 
reference-based assembly; base-calling; and genetic variation detection (SNV, Indel). 
Obviously, experience is essential to the successful analysis of data obtained with NGS 
technology. 
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Overview of next-generation sequencing 


Genomic research has been revolutionized by next-generation sequencing (NGS), which 
allows complete genomes to be sequenced in a single day at a reduced cost (Behjati et al., 
2013; Van Dyjik et al., 2014). This has resulted in significant improvements in disease 
diagnosis, prognosis, and treatment, as well as answers to genetic issues from a variety 
of applications, including forensic science (Fig. 4.1). NGS is now a must-have tool for 
any scientist. From quick single nucleotide polymorphism (SNP) genotyping of a single 
individual to whole genome sequencing (WGS) of huge populations, ultrahigh- 
throughput NGS technologies have a wide range of applications and are fully scalable. 
All NGS platforms perform the sequencing of millions of small fragments of DNA in par- 
allel. Bioinformatics analyses are then used to piece together these fragments by mapping 
the individual reads to the human reference genome. 


NGS versus Sanger sequencing: What’s the difference? 


The concepts behind Sanger and NGS are identical in theory. DNA polymerase adds 
fluorescent nucleotides one by one to a developing DNA template strand in both 
NGS and Sanger sequencing (also known as dideoxy or capillary electrophoresis (CE) 
sequencing). The fluorescent tag on each integrated nucleotide is used to identify it. 
The volume of sequencing differs significantly between Sanger and NGS. While the 
Sanger method only sequences a single DNA fragment at a time, NGS runs can sequence 
millions of fragments at the same time. This procedure results in the simultaneous 
sequencing of hundreds to thousands of genes. With deep sequencing, NGS also has a 
higher discovery capability for detecting novel or rare variants. 


Next-generation sequencing in forensic science 


NGS technology has the potential to be used in the field of forensic science as shown in 
Fig. 4.1 DNA database creation, ancestry, phenotypic inference, monozygotic twin inves- 
tigations, bodily fluid, species identification, forensic animal, plant, and microbiological 
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Figure 4.1 NGS technology uses. 


analyses are only a few examples (Weber-Lehmann et al., 2014; Yang et al., 2014). NGS 
not only facilitates the identification of mixed DNA samples and the analysis of complex 
paternity cases but also has potential advantages for short tandem repeat (STR) testing. 
These include high throughput, low cost, simultaneous detection of large numbers of 
STR loci, and the ability to distinguish alleles with similar length or digital read count. 
Forensic experts are therefore actively looking into ways to transition from CE to NGS 
for DNA profile analysis. NGS has various advantages over CE, including the ability to 
combine data from STR and SNP loci, small amplicons without size separation limits, 
higher discrimination power, deep mixture resolution, and sample multiplexing (Van 
Neste et al., 2014). In a single, tailored assay, NGS allows examiners to generate data 
that spans the human genome and answers a larger range of questions. Furthermore, 
the STR calls generated by NGS are perfectly compatible with current database formats. 
NGS can also examine forensic mitochondrial DNA and target the SNP marker sets not 
commonly available with classic CE-based approaches. Studies show that NGS may be 
used to analyze even the most degraded, and highly mixed evidential samples with 
encouraging results. 


Overview of the sample preparation process for next-generation 
sequencing 


Several steps must be completed in order to analyze samples using NGS. Fig. 4.2 depicts 
an overview of the steps in the NGS sample preparation process. Sample handling and 
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processing are the first steps, followed by DNA extraction and quantitation. After extract- 


Figure 4.2 Overview of the steps 
in the NGS sample preparation 
process. 


ing the DNA samples and determining their quality and quantity, the library preparation 
process begins. Prior to multiplexing and denaturation, samples undergo purification and 
normalization processes following library preparation. 


Sample preparation methods 


Genetic material can be extracted from a number of biological samples, including blood, 
cultured cells, biopsies, tissue sections, teeth, bone, and urine, as well as microbes or plant 
specimens, depending on the goal of the investigation (Carrasco et al., 2020). The sam- 
ples used for NGS forensic DNA typing could be ancient or they could be immaculate 
reference samples. Sample handling is crucial during NGS since samples may be degraded 
or in limited quantities. Additionally, due to the high sensitivity of the NGS process, care 
must be taken to prevent contamination. Methods to decrease or eliminate contamina- 
tion require surfaces and nonautoclavable equipment to be cleaned with the following 
items/equipment: soap and water, UV light irradiation, bleach (10%), ethanol (70%), 
or commercially available solutions, e.g., LookOut DNA Erase Spray (Sigma Aldrich), 
DNAZap (Invitrogen). Many techniques for obtaining DNA from bone and tooth sam- 
ples have been investigated. Because the DNA is preserved in bone cells, marrow, and 
tooth pulp, the samples must first be cleaved, decalcified, and drilled. Crushed samples 
are reduced to tiny bits or powder. To avoid carryover of dust between samples and 
the work area, crushing and pulverizing bone and teeth are done using a laminar flow 
hood. Between uses, stainless steel equipment, such as freezer mill components and die 
presses used to crush and pulverize bone and teeth, is cleaned and autoclaved. To decrease 
the risk of contamination while extracting DNA from bones, all of the activities were 
carried out while wearing face masks, and all reagents and consumables were properly 
sterilized using short-wave ultraviolet light, bleach, or both before leaving the clean 
room (Gaudio et al., 2019). The petrous bone samples were first cleaned by immersing 
them in 5% bleach for 30 s, then rinsed with ethanol and dried. Bone powder was then 
extracted from the cochlea using a direct drilling (Dremel 9100 Fortiflex) rotary tool 
fitted with a small-sized spherical 1.5 mm grinding bit from the inferior side of the 
petrous portion, which was contaminated with bleach and ethanol. 

DNase- and RNase-free tubes and aerosol filter tips are used while working with 
chemicals and samples. Samples should be processed in a separate preparation area 
from the amplification area. When sanding and crushing skeletal remains, analysts should 
use N95 protective masks to avoid dust inhalation. When processing materials in the 
DNA lab, other personal protection equipment such as goggles, gloves, hair nets, masks, 
lab coats, and closed-toed shoes should always be used. 
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DNA extraction 


For a successful NGS, the quality and amount of the extracted samples are also critical. As 
a result, before moving on to the next phases, attention must be paid to selecting a suit- 
able extraction method and setting quality control parameters. Success on any NGS plat- 
form depends on optimal sample preparation. The rising demand for NGS typically puts 
pressure on upstream processes to analyze more samples and prepare high-quality DNA 
for library prep and analysis. 

There are various ways of extracting DNA from cellular material. While organic 
phenol-chloroform-isoamyl alcohol (PCI or PCIA) (Butler, 2005) and inorganic 
metal-chelating Chelex-100 resin (Walsh et al., 1991) were formerly popular for forensic 
DNA extraction, silica-based approaches currently predominate (Brevnov et al., 2009; 
Castella et al., 2006; Elkins et al., 2021; Eychner et al., 2017; Hoff-Olsen et al., 1999). 
The PCIA method remains the gold standard for most applications because it yields 
the most intact DNA from samples. The caveat is that all phenolic organic compounds 
must be removed to prevent the DNA phosphodiester bonds from being attacked by 
the phenolic hydroxyl group, resulting in fragmented, degraded DNA. Furthermore, 
the PCIA technique is not appropriate for all materials; for example, it digests chewing 
gum (Eychner et al., 2017). The Chelex-100 technique is a low-cost, hands-on alterna- 
tive method, but it, like the PCIA method, requires an overnight incubation step that 
must be accommodated in the lab’s standard operating procedure. 

The majority of contemporary DNA extraction techniques used in commercial kits 
are silica-based. Magnetic bead suspensions are used in the Qiagen QIAamp DNA Inves- 
tigator and EZ1 DNA Investigator kits. Magnetic beads and magnetite-modified silicon 
oxide magnetic beads are used in other techniques, such as the Applied Biosystems Prep- 
Filer and the Promega DNA IQ kits. Recent research found that the PrepFiler BTA 
approach outperformed a PCI-silica-based method for extracting DNA from bone 
(Hasap et al., 2019). These procedures are quick, can be automated, and produce the 
most repeatable DNA yields. Beckman Coulter Biomek liquid handling platform, Tecan 
HID EVOlution, Promega Maxwell and Maxprep Instruments, and Qiagen QIAcube 
HT, QIAcube Connect, and EZ1 instruments are among the robotic equipment avail- 
able. In a recent investigation, certain crude bone lysates were shown to hinder DNA 
purification when utilizing paramagnetic silica beads in the DNA IQ Casework Pro 
Kit and the Maxwell 16 owing to filter clogging, although this was improved when 
the lysates were treated with phenol (Desmyter et al., 2017). The Qiagen EZ1 robot 
can be used in conjunction with the EZ1 DNA Investigator Kit. Samples collected on 
FTA cards or similar preservative-containing fibrous material can be extracted using 
one of the aforementioned kits or techniques, such as the EZ1 DNA Investigator Kit, 
or put straight into the polymerase chain reaction (PCR) sample tube in library creation 
(Kampmann et al., 2016). A modified process using the Qiagen EZ1 DNA Investigator 
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Kit, silica-based technology, and the EZ1 BioRobot functionality for extracting DNA 
from current and historic bone for use in DNA typing analysis (Dukes et al., 2012; 
Klavens et al., 2021). Given that contemporary human DNA typing kits generally require 
1 ng of input DNA, any of the extraction procedures should be able to recover enough 
DNA from blood or buccal samples. 

For DNA from chemically treated or degraded bone samples, the Armed Forces 
Medical Examiner System’s Armed Forces DNA Identification Laboratory has developed 
a protocol using demineralized samples (Marshall et al., 2017). Briefly, DNA was isolated 
from 0.2 to 1.0 g powdered bone sample. The bone powder was first demineralized in a 
buffer containing 0.5 M EDTA, 1% sarkosyl, and 20 mg/mL proteinase K by incubating 
overnight with agitation at 56°C. DNA was purified using one of three protocols: 
organic, QIAquick PCR Purification Kit (QIAGEN, Hilden, Germany), or a variation 
of the procedure that utilized the MinElute PCR Purification Kit (QIAGEN) instead of 
the QJAquick (Loreille et al, 2007, 2010). 

When four DNA extraction techniques were assessed (Edson, 2019), it was revealed 
that the best DNA recovery from postcranial osseous human remains of military soldiers 
lost in World War II, the Korean War, and Southeast Asia was achieved by employing 
a comprehensive demineralization approach with organic extraction. Zeng et al. (2019) 
investigated the DNA IQ, DNA Investigator, and PrepFiler BTA kits, as well as the organic 
extraction method for hair and blood and two total demineralization protocols for bone; 
they discovered that all of the DNA extraction methods were efficient and compatible 
with the Precision ID and ForenSeq kits. Carrasco et al. (2020) described a customized 
extraction procedure for samples of deteriorated blood and dental remnants. Briefly, 
genomic DNA was extracted from dental tissue samples using the one-step, 1-h QuickEx- 
tract FFPE DNA Extraction Kit (Lucigen) according to the manufacturer’s instructions. A 
blood sample was extracted using organic methods (Proteinase K (ProK) and SDS digestion 
with phenol-chloroform extraction) and purified by cold ethanol precipitation followed by 
resuspension in Tris-EDTA buffer. Degraded blood samples were created by subjecting the 
extracted DNA to sonication at 50°C for time spans ranging from 0 to 16 h. Sidstedt et al. 
(2020) investigated PCR inhibition in massively parallel sequencing (MPS) applications. 
The authors mention that though the classical solution to handle PCR inhibition is to pu- 
rify or dilute DNA extracts, which leads to DNA loss, inhibitor-tolerant DNA polymerases 
either single enzymes or blends provide a more straightforward solution. 

DNA recovery from trace materials such as hair, fingernails, and fingerprints, as well as 
other contact DNA, has been studied for MPS applications (Naue et al., 2020). Preuner 
et al. (2014) discovered that using the PrepFiler Forensic DNA Extraction Kit resulted in 
high-quality DNA extracted from fingernail clippings. Tasker et al. (2017) recovered 
DNA from improvised explosive device (IED) pipe bombs after they had been detonated 
using cotton swabs treated with 2% sodium dodecyl sulfate and by the QlAamp DNA 
Mini Kit (Qiagen Inc., Hilden, Germany). England et al. (2020) used DNA isolated 
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from laser-microdissected cells to evaluate the ForenSeq DNA Signature Prep Kit. Eych- 
ner et al. (2017) evaluated five DNA extraction techniques for recovering DNA from 
chewing gum and saliva aliquoted on swabs (PCIA, QIAamp DNA Investigator, 
DNA IQ, Chelex-100, and PrepFiler); the QlAamp method performed the best overall. 
To concentrate the recovered DNA from low-quantity samples such as chewing gum, 
touch DNA, and human remains, ethanol precipitation, concentrator devices, and lower 
elution volumes can be used (Eychner et al., 2017). 

For extracting DNA from soil samples, equal amounts of soil were mixed with freshly 
prepared (to avoid bacterial contamination) saturated phosphate buffer to remove any PCR 
inhibitors (mainly humic acid). The mixture was then subjected to centrifugation. The su- 
pematant containing the extracellular DNA is then extracted using a commercial kit for soil 
DNA (NucleoSpin Soil; Macherey-Nagel, Duren, Germany) following the manufac- 
turer’s instructions (Giampaoli et al., 2014). In a pilot study for the validation of the Pre- 
cision ID GlobalFiler NGS STR panel v2 with the Ion S5 system, DNA from samples 
(blood stain, nail, hair, and buccal swab) was extracted using the DNA extraction kit 
from Finegene Biotech, Co., Ltd., China (Tao et al., 2019). The DNA from muscle sam- 
ples was extracted using the DNeasy Blood and Tissue kit (Qiagen). 


DNA quantitation 


There are numerous DNA quantification methods available for determining the amount 
of DNA recovered and how much of the extract to use for NGS typing. Human-specific 
and nonspecific quantification techniques are available. UV-Vis and fluorescence spec- 
troscopy are two spectroscopic approaches for measuring DNA. Spectroscopic tech- 
niques for estimating total DNA in a sample can be utilized, although they are not 
human-specific (Elkins, 2013). Because they need as little as 1 WL of material, the Nano- 
Drop (UV-Vis spectroscopy) and Qubit (fluorescence spectroscopy) spectrophotometers 
are commonly used for fast DNA measurement. Real-time PCR techniques can be used 
to determine the amount and amplification of human DNA (Horsman et al., 2006). 
Commercial assays as well as “home brew” tests can be used. Total human and male 
genomic DNA is quantified using a real-time PCR technique targeting the TPOX re- 
gion (Horsman et al., 2006). Several commercial kits are available to measure the amount 
of human DNA ina sample, as well as the presence of human and male genomic DNA as 
well as identify PCR inhibition utilizing real-time PCR and multiple dye channel fluo- 
rescence detection. These include the Applied Biosystems Quantifiler Human DNA 
Quantification, Quantifiler DUO, and Quantifiler Trio kits (Barbisin et al., 2009; Green 
et al., 2005), as well as the Promega Plexor HY and PowerQuant System kits (Krenke 
et al., 2008). The Qiagen Investigator Quantiplex Pro and HYres Kits measure the total 
amount of human and male DNA in a sample, identify PCR inhibition, and offer a 
degradation index (Morrison et al., 2020; Vranes et al., 2017). While the Quantifiler 
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and Plexor HY kits take around 2 h to complete, the Quantiplex HYres Kit takes just 
1h. The sensitivity of commercial assays has increased over time, and the most recent 
kit, the Quantiplex HYres kit, is the most sensitive. The Quantiplex HYres Kit is sensi- 
tive to 1 pg/L of total DNA, whereas the Plexor HY kit is sensitive to 6.4 pg. The Quan- 
tifiler DUO kit detects 51 pg of DNA. For usage with the PowerSeq 45GY NGS kit, the 
PowerSeq Quant MS System, QuantiFluor ONE dsDNA System, and QuantiFluor 
dsDNA System are suggested. The more degraded the DNA, the fewer loci should be 
predicted to be typed using any DNA typing method. A precise measurement of the 
amount of human DNA in a sample is required for estimating the proper eluent input 
for NGS. The first step with NGS is to prepare a library. The extracted DNA can be 
diluted to obtain the ideal input quantity specified by the manufacturer of the NGS 
kit. Low template samples may be utilized and input improved by adding more extract 
and no water to the library preparation PCR reactions, but if sufficient DNA is not given 
in the reaction, a lower number of reads and coverage may result. Pajnic et al. (2020) used 
the PowerQuant kit to quantify human remains from a World War II slaughter of a 
Slovenian family, producing DNA profiles from a molar and five femurs. 

In a study where the beta version of the ForenSeq DNA Signature Prep Kit MiSeq 
FGx system was evaluated, casework-type and ancient DNA samples were extracted 
and then quantitated with a TaqMan assay targeting an Alu repetition motif of 79 bp 
(Niederstatter et al., 2007; Xavier et al., 2017). For the developmental validation of 
the MiSeq FGx forensic genomics system, buccal cell samples were quantitated using 
the Quantifiler Human DNA quantification kit (Life Technologies, Carlsbad, CA) 
on the Stratagene Mx3000P qPCR System (Agilent, Santa Clara, CA) using the Qubit 
dsDNA HS or BR assay kit on a Qubit 2.0 Fluorometer (Life Technologies), according 
to the respective manufacturers’ instructions (Jager et al., 2017). In another study, the 
performance of the MPS-STR panel (Precision ID GlobalFiler and NGS-STR panel 
(Thermo Fisher Scientific) designated for MPS on the Ion Torrent platform was evalu- 
ated using human blood samples that were extracted using the PureLink Genomic DNA 
Mini kit (Thermo Fisher Scientific), followed by quantitation using the Quantifiler Hu- 
man DNA quantification kit on an Applied Biosystems 7500 Real-Time PCR system 
following the manufacturer’s instructions (Wang et al., 2017). 


Library preparation 


Library preparation is a set of processes used to prepare a sample for NGS sequencing. It 
starts with previously extracted, quantitated, and diluted samples from various sources 
mentioned above. The number of steps to prepare the library varies substantially depend- 
ing on the manufacturer and procedure. NGS kits are available from Thermo Fisher, 
Verogen, Promega, and Qiagen. When utilizing kits from Thermo Fisher Scientific, 
there are three panels for analyzing STR and SNP markers. These include the Precision 
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ID GlobalFiler NGS STR panel, the Precision ID Ancestry panel, and the Precision ID 
Identity panel. The Precision ID GlobalFiler NGS STR Panel includes primer sets for 35 
markers, including the same 21 autosomal STR markers, the amelogenin sex markers, 
and 14 additional informative markers for forensic analysis (Muller et al., 2018; Wang 
et al., 2017). The Precision ID Ancestry Panel from Applied Biosystems consists of 
165 autosomal markers (that provide biogeographic ancestry information), including 
55 markers created by Dr. Kenneth Kidd (Kidd et al., 2012) and 123 markers based 
on a publication by Dr. Michael Seldin (Kosoy et al., 2009). The Precision ID Identity 
Panel consists of 124 autosomal SNPs with high heterozygosity and a low fixation index 
(Fst) (Pakstis et al., 2010). 

The MiSeq FGx Forensic Genomics System, comprised of the ForenSeq DNA Signa- 
ture Prep Kit, MiSeq FGx Reagent Kit, MiSeq FGx instrument, and ForenSeq Universal 
Analysis Software, uses PCR to simultaneously amplify up to 231 forensic loci in a single 
multiplex reaction (Jager et al., 2017). Targeted loci include amelogenin, 27 common, 
forensic autosomal STRs, 24 Y-STRs, 7 X-STRs, and 3 classes of SNPs. The ForenSeq 
kit includes two primer sets: Amelogenin, 58 STRs, and 94 identity-informative SNPs 
(SNPs) are amplified using DNA Primer Set A (DPMA; 153 loci); if'a laboratory choo- 
ses to generate investigative leads using DNA Primer Set B, amplification is targeted to 
the 153 loci in DPMA plus 22 phenotypic-informative SNPs (piSNPs) and 56 biogeo- 
graphical ancestry-informative SNPs (aiSNPs). 

The ForenSeq Signature Prep kit from Verogen uses PCR to simultaneously amplify 
231 forensic loci in a single multiplex reaction (Jager et al., 2017). Targeted loci include 
amelogenin, 27 common, forensic autosomal STRs, 24 Y-STRs, 7 X-STRs, and 3 clas- 
ses of SNPs. The ForenSeq kit includes two primer sets: Amelogenin, 58 STRs, and 94 
uSNPs are amplified using DNA Primer Set A (DPMA; 153 loci). If investigative leads 
are needed using DNA Primer Set B, amplification is targeted to the 153 loci in 
DPMA plus 22 piSNPs and 56 biogeographical aiSNPs. The ForenSeq MainstAY kit 
from Verogen contains 53 standard loci for use in forensic analysis, relationship testing, 
and research. The kit combines 27 autosomal STRs, 25 YSTRs, and amelogenin in a sin- 
gle reaction to provide higher likelihood ratios and enable direct comparison of single- 
source samples. 

The Promega PowerSeq 46GY kit is a 46-locus multiplex system that targets 22 auto- 
somal STR markers, 23 Y-STRs, and amelogenin. The extended panel of STR markers 
is intended to satisfy both the Combined DNA Index System and the European Standard 
Set recommendations. 

While many NGS applications require WGS, forensic genomic applications use a 
tailored sequencing technique that uses an amplicon-based workflow to target forensi- 
cally relevant sequences. The library preparation begins with amplifying and enriching 
the targets, followed by the addition of tags (tagmentation), indices for demultiplexing, 
and adapter sequences for flow cell binding, and finally library purification and 


Processing of biological samples for forensic NGS analysis 


normalization. During the library preparation process, in addition to the samples to be 
sequenced, a positive control (e.g., 2800 M) and a negative control (no template control) 
are processed. The ForenSeq kit requires a two-step amplification to prepare the library. 
For each forensically important target sequence in the DNA sample, the first PCR step 
uses sequence-specific, tagged primer pairs. Indexes and adapters are integrated into the 
amplicons during the second PCR cycle. After that, the amplicon libraries are purified, 
pooled, and linearized in a single tube. During the first PCR step, a forward tag con- 
nected to the forward primer and a reverse tag linked to the reverse primer are added 
to the target amplicon. All of the targets have the same sequence tags appended to 
them. Using the tags added in the first step of PCR, adapter sequences are added adjacent 
to the primer sequences to allow the amplicons to bind to the flow cell for sequencing. 
The unique forward and reverse adapter 17 and i5 index sequence combinations are 
added to the targets to label them for demultiplexing interpretation in a second PCR 
step. The library is injected into a flow cell, where fragments are captured on a lawn 
of surface-bound oligos complementary to the library adapters. The indices are a distinct 
collection of sequences used to allocate data to samples. An index, which is similar to a 
barcode, is made up of an 8-bp sequence that is used to demultiplex the sample data once 
it has been sequenced. As a result, the completed “library” for each target includes an i5 
adapter, an i5 index, a forward tag, a target sequence, a reverse tag, an i7 index, and an i7 
adapter. The ForenSeq first and second PCR steps produce amplicons ranging in size 
from 60 to 460 bp. The French National Police used a Hamilton ID NGS-V STARIet 
robot to automate the library preparation procedure (Laurent et al., 2017). 

The Thermo Fisher Ion Chef robot can be used to perform NGS library preparation 
for any of the human identification (HID) Ion AmpliSeq panels, including the Global- 
Filer NGS STR Panel v2, Identity Panel, AmpliSeq Identity Panel, and the Ampliseq 
Ancestry Panel. When employing the robot, the plate containing the eight samples to 
be sequenced, consumables, and the master mix are all loaded into the Ion Chef. Con- 
sumables include Ion $5 Precision ID Chef Solutions reagent cartridges, a chip adapter, 
an enrichment strip cartridge, a tip cartridge with pipet tips, a PCR plate and frame seal, a 
disposable lid for the recovery station, recovery tubes, and one or two sequencing chips. 
The barcodes of the objects placed in the instrument are scan by a camera system. The 
Ion Chef contains a thermocycler that automates the library preparation, library purifi- 
cation, library normalization, and chip loading procedures for the eight samples in a 7- 
h process. Using a combination of restriction enzymes, each DNA sample is sliced into 
millions of pieces. The Ion Xpress Barcode Adapters 1—96 Kit and the lonCode Barcode 
Adapters 1-384 Kit are used in the kit. Barcoded libraries may be concatenated and 
loaded onto a single Ion chip, reducing sequencing run time and cost while still allowing 
for reliable sample-to-sample comparisons. Each fragment binds to its own primer- 
coated bead, giving rise to template ion sphere particles (ISPs). The ISPs are cleansed, 
and those that test positive for the template are put into the chip at the enrichment 
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station. Each bead falls into a well after flowing across the chip. When employing the Ion 
Chef with a 530 chip, 24 samples may be put on one chip in around 10 h for autosomal 
NGS library creation. 

To summarize, a library is a DNA sample that has indexes and adapters connected to 
either end and is ready for sequencing. Following library preparation, the samples are 
mixed together in the same tube to create a mixture, or multiplex, of different libraries 
pooled together and ready for sequencing. 


Library normalization 


The library purification process comes after library preparation. Excess primers and re- 
agents are removed during the library purification or clean-up process. The ForenSeq 
Signature Prep Kit makes use of magnetic beads. Working with fewer samples at a 
time during the bead-based phases results in better sequencing findings. The library is 
initially added to a consistent number of magnetic beads in the method, and the DNA 
binds the beads. A magnetic block at the bottom of the plate is utilized to attract the 
DNA-bound beads to its surface. Ethanol washes are used to clean the beads and elim- 
inate any leftover primers for PCR reagents. After removing the extra ethanol, the DNA 
library is freed from the beads using a resuspension buffer. Users must use caution to 
ensure that all of the ethanol is removed. The library produced following the purification 
stage contains more of each library than will be required for sequencing. Prior to 
sequencing, the amplicon lengths and quality can be checked using an agarose or poly- 
acrylamide gel, a BioAnalyzer, or XIAxcel equipment. High-quality ForenSeq libraries 
have amplicons in the 60—460 bp range. Primer dimers are identified by short amplicons 
of about 60 bp. 

As previously stated, the lon Chef handles not only library preparation but also library 
quantification, purification, library normalization, and chip loading. The library normal- 
ization phase comes after library purification. The Ion Library TaqMan Quantitation Kit 
may be used to quantify the eight-sample library. Because the PCR processes in library 
preparation can provide a wide range of yields, the library normalization step is used to 
equalize the quantity and concentration of each sample library to guarantee that each li- 
brary is represented equally when pooled. Following the determination of which samples 
will be sequenced in the same run, the generated libraries must be normalized and pooled 
together for the sequencing run. Libraries generated at separate periods can be normalized 
together to be sequenced on the same flow cell if they contain unique index combina- 
tions. The pooled libraries are diluted to 50 pM and combined in the order of the 530 
chip’s barcode adapters (1—32). 

Bead-based normalization is used in the ForenSeq Signature Prep Kit. The normal- 
izing procedure employs a consistent number of magnetic beads to bind the same amount 
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of library to each sample. Before usage, the beads must be warmed to room temperature 
and well mixed to guarantee an equitable distribution of beads to each library. Beads are 
introduced to each well containing a library and bind an equal, maximum amount of 
each library depending on the binding capacity and number of beads added. The mag- 
netic beads bind the DNA libraries, remove the surplus, and then elute the normalized 
libraries from the beads. 

According to Verogen, the library preparation stage utilizing the ForenSeq Signature 
Prep Kit takes around 9 h. In practice, the hands-on time varies depending on the num- 
ber of samples and the investigator’s competence in creating the library with the processes 
and methods. 


References 


Barbisin, M., Fang, R., O’Shea, C. E., Calandro, L. M., Furtado, M. R., & Shewale, J. G. (2009). Devel- 
opmental validation of the quantifiler® duo DNA quantification kit for simultaneous quantification of 
total human and human male DNA and detection of PCR inhibitors in biological samples*. Journal of 
Forensic Sciences, 54(2), 305—319. https://doi.org/10.1111/j.1556-4029.2008.0095 1.x 

Behjati, S., & Tarpey, P. S. (2013). What is next generation sequencing? Archives of disease in childhood. 
Education and Practice Edition, 98(6), 236—238. https://doi.org/10.1136/archdischild-2013-304340 

Brevnov, M. G., Pawar, H. S., Mundt, J., Calandro, L. M., Furtado, M. R., & Shewale, J. G. (2009). Devel- 
opmental validation of the PrepFiler” forensic DNA extraction kit for extraction of genomic DNA from 
biological samples*. Journal of Forensic Sciences, 54(3), 599-607. https://doi.org/10.1111/j.1556- 
4029.2009.01013.x 

Butler, J. M. (2005). Forensic DNA typing: Biology, technology, and genetics of STR markers. Elsevier. 

Carrasco, P., Inostroza, C., Didier, M., Godoy, M., Holt, C. L., Tabak, J., & Loftus, A. (2020). Optimizing 
DNA recovery and forensic typing of degraded blood and dental remains using a specialized extraction 
method, comprehensive QPCR sample characterization, and massively parallel sequencing. International 
Journal of Legal Medicine, 134(1), 79—91. https://doi.org/10.1007/s00414-019-02124-y 

Castella, V., Dimo-Simonin, N., Brandt-Casadevall, C., & Mangin, P. (2006). Forensic evaluation of the 
QJAshredder/QIAamp DNA extraction procedure. Forensic Science International, 156(1), 70—73. 
https://doi.org/10.1016/j.forsciint.2005.11.012 

Desmyter, S., De Cock, G., Moulin, S., & Noél, F. (2017). Organic extraction of bone lysates improves 
DNA purification with silica beads. Forensic Science International, 273, 96—101. https://doi.org/ 
10.1016/.forsciint.2017.02.003 

Dukes, M. J., Williams, A. L., Massey, C. M., & Wojtkiewicz, P. W. (2012). Technical note: Bone DNA 
extraction and purification using silica-coated paramagnetic beads. American Journal of Physical Anthropol- 
ogy, 148(3), 473—482. https://doi.org/10.1002/ajpa.22057 

Edson, S. M. (2019). Extraction of DNA from skeletonized postcranial remains: A discussion of protocols and 
testing modalities. Journal of Forensic Sciences, 64(5), 1312-1323. https://doi.org/10.1111/1556- 
4029.14050 

Elkins, K. M. (2013). Forensic DNA biology: A laboratory manual. Academic Press. 

Elkins, K. M., & Zeller, C. B. (2021). Next generation sequencing in forensic science: A primer. CRC Press. https:// 
doi.org/10.4324/9781003196464 

England, R., Nancollis, G., Stacey, J., Sarman, A., Min, J., & Harbison, S. (2020). Compatibility of the Fore- 
nSeq” DNA signature prep kit with laser microdissected cells: An exploration of issues that arise with 
samples containing low cell numbers. Forensic Science International: Genetics, 47, 102278. https:// 
doi.org/10.1016/j.fsigen.2020.102278 

Eychner, A. M., Schott, K. M., & Elkins, K. M. (2017). Assessing DNA recovery from chewing gum. Med- 
icine, Science & the Law, 57(1), 7—11. https://doi.org/10.1177/0025802416676413 


69 


70 


Next Generation Sequencing (NGS) Technology in DNA Analysis 


Gaudio, D., Fernandes, D. M., Schmidt, R., Cheronet, O., Mazzarelli, D., Mattia, M., O’keefe, T., 
Feeney, R. N. M., Cattaneo, C., & Pinhasi, R. (2019). Genome-wide DnA from degraded petrous 
bones and the assessment of sex and probably geographic origins of forensic cases. Nature Scientific Reports, 
9(8226), 1-11. https://doi.org/10.1038/s41598-029-44638-w 

Giampaoli, S., Berti, A., DiMaggio, R. M., Pili, E., Valentini, A., Valeriani, F., Gianfranceschi, G., Barni, F., 
Rupani, L., & Romano Spica, V. (2014). The environmental biological signature: NGS profiling for 
forensic comparison of soils. Forensic Science International, 240, 41—47. 

Green, R. L., Roinestad, I. C., Boland, C., & Hennessy, L. K. (2005). Developmental validation of the 
quantifiler™ real-time PCR kits for the quantification of human nuclear DNA samples. Journal of Forest 
Science, 50(4), 809-825. 

Hasap, L., Chotigeat, W., Pradutkanchana, J., Asawutmangkul, W., Kitpipit, T., & Thanakiatkrai, P. (2019). 
Comparison of two DNA extraction methods: PrepFiler® BTA and modified PCI-silica based for DNA 
analysis from bone. Forensic Science International: Genetics Supplement Series, 7(1), 669-670. https:// 
doi.org/10.1016/j.tsigss.2019.10.132 

Hoff-Olsen, P., Mevag, B., Staalstrom, E., Hovde, B., Egeland, T., & Olaisen, B. (1999). Extraction of DNA 
from decomposed human tissue: An evaluation of five extraction methods for short tandem repeat 
typing. Forensic Science International, 105(3), 171-183. https://doi.org/10.1016/S0379-0738(99) 
00128-0 

Horsman, K. M., Hickey, J. A., Cotton, R. W., Landers, J. P., & Maddox, L. O. (2006). Development of a 
human-specific real-time PCR assay for the simultaneous quantitation of total genomic and male 
DNA*. Journal of Forensic Sciences, 51(4), 758—765. https://doi.org/10.1111/j.1556-4029.2006.00183.x 

Jager, A. C., Alvarez, M. L., Davis, C. P., Guzman, E., Han, Y., Way, L., Walichiewicz, P., Silva, D., 
Pham, N., Caves, G., Bruand, J., Schlesinger, F., Pond, S. J. K., Varlaro, J., Stephens, K. M., & 
Holt, C. L. (2017). Developmental validation of the MiSeq FGx forensic genomics system for targeted 
next generation sequencing in forensic DNA casework and database laboratories. Forensic Science Interna- 
tional: Genetics, 28, 52—70. https://doi.org/10.1016/j.fsigen.2017.01.011 

Kampmann, M.-L., Buchard, A., Borsting, C., & Morling, N. (2016). High-throughput sequencing of 
forensic genetic samples using punches of FTA cards with buccal swabs. Biotechniques, 61(3), 
149-151. https://doi.org/10.2144/000114453 

Kidd, K. (2012). Expanding data and resources for forensic use of SNPs in individual identification. Forensic 
Science International: Genetics, 6(5), 646—652. 

Klavens, A., Kollmann, D. D., Elkins, K. M., & Zeller, C. B. (2021). Comparison of DNA yield and STR 
profiles from the diaphysis, mid-diaphysis, and metaphysis regions of femur and tibia long bones. Journal 
of Forensic Sciences, 66(3), 1104—1113. https://doi.org/10.1111/1556-4029.14657 

Kosoy, R., Nassir, R., & Tian, C. (2009). Ancestry informative marker sets for determining continental 
origin and admixtures proportions in common populations in America. Human Mutation, 30(1), 69-78. 

Krenke, B. E., Nassif, N., Sprecher, C. J., Knox, C., Schwandt, M., & Storts, D. R. (2008). Developmental 
validation of a real-time PCR assay for the simultaneous quantification of total human and male DNA. 
Forensic Science International: Genetics, 3(1), 14—21. https://doi.org/10.1016/j.fsigen.2008.07.004 

Laurent, F. X., Ausset, L., Clot, M., Jullien, S., Chantrel, Y., Hollard, C., & Pene, L. (2017). Automation of 
library preparation using illumina ForenSeq kit for routine sequencing of casework samples. Forensic Sci- 
ence International: Genetics Supplement Series, 6, e415—e417. https://doi.org/10.1016/j.fsigss.2017.09.156 

Loreille, O. M., Diegoli, T. M., Irwin, J. A., Coble, M. D., & Parsons, T. J. (2007). High efficiency DNA 
extraction from bone by total demineralization. Forensic Science International, 1, 191-195. 

Loreille, O. M., Parr, R. L., McGregor, K. A., Fitzpatrick, C. M., Lyons, C., Yang, D. Y., Speller, C. F., & 
Grimm, M. R. (2010). Integrated DNA and fingerprint analyses in the identification of 60-year old 
mumunified human remains discovered in an Alaskan Glacier. Journal of Forest Science, 55, 813—818. 

Marshall, C., Sturk-Andreaggi, K., Daniels-Higginbotham, J., Oliver, R. S., Barritt-Ross, S., & 
McMahon, T. P. (2017). Performance evaluation of a mitogenome capture and illumina sequencing 
protocol using non-probative, case-type skeletal samples: Implications for the use of a positive control 
in a next-generation sequencing procedure. Forensic Science International: Genetics, 31, 198—206. 


Processing of biological samples for forensic NGS analysis 


Morrison, J., McColl, S., Louhelainen, J., Sheppard, K., May, A., Girdland-Flink, L., Watts, G., & 
Dawnay, N. (2020). Assessing the performance of quantity and quality metrics using the QIAGEN 
Investigator® Quantiplex® Pro RGQ Kit. Science & Justice, 60(4), 388-397. https://doi.org/ 
10.1016/).scijus.2020.03.002 

Muller, P., Alonso, A., Barrio, P. A., Berger, B., Bodner, M., Martin, P., & Parsons, W. (2018). The DNA- 
SEQEX consortium. Systematic evaluation of the early access applied biosystems precision ID globalfiler 
mixture ID and blobalfiler NGS STR panels for the ion $5 system. Forensic Science International: Genetics, 
36, 95-103. 

Naue, J., Sanger, T., & Lutz-Bonengel, S. (2020). Get it off, but keep it: Efficient cleaning of hair shafts with 
parallel DNA extraction of the surface stain. Forensic Science International: Genetics, 45, 102210. https:// 
doi.org/10.1016/j.fsigen.2019.102210 

Niederstatter, H., Kochl, S., Grubwieser, P., Pavlic, M., Steinlechner, M., & Parson, W. (2007). A modular 
real-time PCR concept for determining the quantity and quality of human nuclear and mitochondrial 
DNA. Forensic Science International: Genetics, 1, 29—34. 

Pajnic, Z., Obal, M. I., & Zupanc, T. (2020). Identifying victims of the largest second World war family 
massacre in Slovenia. Forensic Science International, 306, 110056.  https://doi.org/10.1016/ 
j-forsciint.2019.110056 

Pakstis, A. J., Speed, W. C., Fang, R., Hyland, F. C. L., Furtado, M. H., Kidd, J. R., & Kidd, K. K. (2010). 
SNPs for a universal individual identification panel. Human Genetics, 127, 315—324. 

Preuner, S., Danzer, M., Préll, J., Potschger, U., Lawitschka, A., Gabriel, C., & Lion, T. (2014). High- 
quality DNA from fingernails for genetic analysis. Journal of Molecular Diagnostics, 16(4), 459—466. 
https://doi.org/10.1016/j.jmoldx.2014.02.004 

Sidstedt, M., Radstrom, P., & Hedman, J. (2020). PCR inhibition in QPCR, DPCR and MPS— 
Mechanisms and solutions. Analytical and Bioanalytical Chemistry, 412(9), 2009-2023. https://doi.org/ 
10.1007/s00216-020-02490-2 

Tao, R., Wenjie, Q., Chen, C., Zhang, J., Yang, Z., Song, W., Zhang, S., & Li, C. (2019). Pilot study for 
forensic evaluations of the precision ID GlobalFiler™ NGS STR panel with the ion $5™ system. Forensic 
Science International, 43, 102147. 

Tasker, E., LaRue, B., Beherec, C., Gangitano, D., & Hughes-Stamm, S. (2017). Analysis of DNA from 
post-blast pipe bomb fragments for identification and determination of ancestry. Forensic Science Interna- 
tional: Genetics, 28, 195—202. https://doi.org/10.1016/j.fsigen.2017.02.016 

Van Djik, E. L., Auger, H., Jaszczyszyn, & Thermes, C. (2014). Ten years of next-generation sequencing 
technology. Trends in Genetics, 30(9), 418—426. 

Van Neste, C., Vandewoestyne, M., Van Criekinge, W., Deforce, D., & Nieuwerburghm, F. V. (2014). 
My-forensic-loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive 
parallel sequencing. Forensic Science International, 9, 1—8. 

Vranes, M., Scherer, M., & Elliott, K. (2017). Development and validation of the Investigator® Quantiplex 
Pro Kit for QPCR-based examination of the quantity and quality of human DNA in forensic samples. 
Forensic Science International: Genetics Supplement Series, 6, e518—e519. https://doi.org/10.1016/ 
j-fsigss.2017.09.207 

Walsh, P. S., Metzger, D. A., & Higuchi, R. (1991). Chelex 100 as a medium for simple extraction of DNA 
for PCR-based typing from forensic material. Biotechniques, 10(4), 506—513. 

Wang, Z., Zhou, D., Wang, H., Jia, Z., Liu, J., Qian, X., Li, C., & Hou, Y. (2017). Massively parallel 
sequencing of 32 forensic markers using the precision ID global filer NGS STR panel and the ion 
PGM system. Forensic Science International: Genetics, 31, 126—134. 

Weber-Lehmann, J., Schilling, E., Gradl, G., Richter, D. C., Wiehler, J., & Rolf, B. (2014). Finding the 
needle in the haystack: Differentiating “identical” twins in paternity testing and forensics by ultra- 
deep next generation sequencing. Forensic Science International, 9, 42—46. 

Xavier, C., & Parson, W. (2017). Evaluation of the [lumina ForenSeq™ DNA Signature Prep Kit- 
MPS forensic application for the MiSeq FGX benchtop sequencer. Forensic Science International, 
188-194. 


71 


72 Next Generation Sequencing (NGS) Technology in DNA Analysis 


Yang, Y., Xie, B., & Yan, J. (2014). Application of next-generation sequencing technology in forensic 
science. Genomics, Proteomics & Bioinformatics, 12(5), 190-197. https://doi.org/10.1016/j.gpb.2014. 
09.001 

Zeng, X., Elwick, K., Mayes, C., Takahashi, M., King, J. L., Gangitano, D., Budowle, B., & Hughes- 
Stamm, S. (2019). Assessment of impact of DNA extraction methods on analysis of human remain sam- 
ples on massively parallel sequencing success. International Journal of Legal Medicine, 133(1), 51—58. 
https://doi.org/10.1007/s00414-018-1955-9 


CHAPTER 5 


Commercial kits commonly used for 
NGS based forensic DNA analysis 


Tugba Unsal Sapan 
Uskiidar University, Faculty of Engineering and Natural Science, Department of Forensic Sciences, Institute of Addiction and 
Forensic Sciences, Istanbul, Turkey 


Introduction 


DNA sequencing technology is used for amplified DNA regions in forensic genetics. The 
use of DNA markers for forensic purposes and their sequencing started with the discovery 
of restriction length polymorphism in the early 1980s and then gained momentum with 
short tandem repeats (STRs). In the following years, additional markers such as mito- 
chondrial DNA (mtDNA) sequencing were developed and used in identification studies 
for forensic DNA analysis, especially for ancestry analysis. 

In addition, since STRs spread over long intervals on the DNA sequence, they were 
insufficient in concluding some forensic events since they required a higher amount of 
DNA, and to elucidate these events, single-nucleotide polymorphisms (SNPs), which 
were located at fewer short intervals in the DNA sequence and allowed to work with 
less DNA amount started to be used. 

Since SNPs are biallelic, their discriminatory power is low alone, and it is necessary to 
study 50—60 SNPs to reach the discrimination power of 13 STRs. In this way, as a result 
of the limitations of Sanger sequencing technology in a multilocus study, such as not be- 
ing able to analyze in one run, researchers have turned to next-generation sequencing 
(NGS) technologies as an alternative in forensic DNA analysis. In forensic DNA analysis, 
NGS technology has provided significant advantages for SNP or mtDNA analysis of 
gradient samples, the analysis of challenging paternity cases, and new species analysis 
such as ancestry and phenotype determination. In addition, it has become the preferred 
system in recent years, allowing many systems to be analyzed in a single workflow 
(Borsting & Morling, 2015; Yang et al., 2014). 


Development of next-generation sequencing systems 


NGS technology refers to nonSanger sequencing based on high-throughput DNA 
sequencing technology. With this technology, millions or billions of DNA molecules 
can be sequenced in parallel. Thus, the throughput increases significantly and minimizes 
the need for the fragment analysis method frequently used in Sanger sequencing. Second- 
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generation sequencing technology based on loop sequence sequencing, which can 
analyze large numbers of samples simultaneously, has also been developed, as well as 
third-generation sequencing technologies that can determine the basic composition of 
single DNA molecules (van Dijk et al., 2014). 


Second-generation NGS systems 


Different methods are used in the sequencing of NGS systems: 

¢ Pyrosequencing: Using the emulsion PCR technique with oligonucleotides attached 
to clonal amplification beads. 

¢ Sequencing by synthesis: It is the synthesis technology that follows the addition of 
labeled nucleotides as the DNA strand is copied. 

¢ Sequencing by ligation: It is a DNA sequencing method that uses the DNA ligase 
enzyme to identify nucleotides at a specific position in the DNA sequence. 

¢ Semiconductor sequencing: This is a DNA sequencing method based on the detec- 
tion of hydrogen ions released during the polymerization of DNA exposed to 
hydrogen ion bombardment. 

¢ Shotgun method: used for sequencing large genomes. The method involves breaking 
the genome into a collection of small DNA fragments that are sequenced individually. 
A computer program looks for overlaps in the DNA sequences and uses them to place 
the individual fragments in their correct order to reconstitute the genome. This 
method is the most common of the NGS methods. This method is not used in 
forensic analyses because it works with large genomes and is not suitable for pairwise 
comparison. 

¢ Targetted (re) sequencing method: This is another method used. It provides amplifi- 
cation of selected regions by PCR or sequencing of the desired region using probes or 
both probes and enzymes. This method allows more samples to be run simulta- 
neously. Unlike the Shotgun method, Targeted (re) sequencing is suitable for use 
in forensic sciences. With this method, identification, phenotype predictions, and tis- 
sue typing can be done (Alvarez-Cubero et al., 2017; Borsting & Morling, 2015; 
Daniel et al., 2015; Heather & Chain, 2016; Mamanova et al., 2010; Mardis, 2008). 
The first development in the field of NGS was made by 454 Corporation in 2005. 

The company developed the pyrosequencing method using the emulsion PCR tech- 

nique with oligonucleotides attached to clonal amplification beads and named its first sys- 

tem “Genome Sequencer 20”. With this device, the human genome has been sequenced 

in 5 months. However, as a result of the development of technology every second and 

the development of new methods with new systems, this device is not produced today. 

This system was purchased by Roche in the following years and updated with the 

“Roche GS FLX Titanium” platform. The 454 Genome Sequencer has approximately 

2 million wells for analysis, each in the form of a 28-j1m-diameter bead coated with 
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single-stranded copies of the target sequence. Current 454/Roche GS FLX Titanium 
platforms can sequence DNA regions of 300—500 nt in length in approximately 1.5 
million wells in a single analysis (Kircher & Kelso, 2010; McGinn & Glynne Gut, 2013). 

The second NGS system was created by Solexa in 2006. It launched the “Genome 
Analyzer” system. Solexa merged with Illumina in 2007 and introduced the “Illumina 
HiSeq 2500” system. With Illumina HiSeq 2500, the human genome was sequenced 
30 times more in 1 day. In 2011, Illumina introduced systems named iSeq 100 System, 
Miniseq System, MiSeq System, NextSeq 550, and NextSeq 1000—2000, respectively, 
which enable lower volumes of data to be studied in a shorter time. The technique 
used in these systems is the SNaPshot sequencing (sequencing by synthesis) method. 
Up to 400 million reads can be made for target DNA sequencing of 150—300 nt per anal- 
ysis in the wells of [lumina systems (Alvarez-Cubero et al., 2017; McGinn & Glynne 
Gut, 2013; Nextseq 550 System, Specification Sheet, 2021). 

The third NGS system is owned by Applied Biosystems. In 2007, Applied Biosystems 
(ABD) released the “SOLiD” second-generation sequencing system based on the oligonu- 
cleotide ligation technique and binary coding system. In 2010, they developed the ABI 
“Ton Torrent” system, a faster and lower-cost sequencer based on semiconductor tech- 
nology. The Ion Torrent system uses fluorescence, chemiluminescence, or enzyme- 
cascading techniques to sequence signal detection. In the following years, Thermo Fisher 
Scientific, which acquired Applied Biosystems, developed the NGS technologies Ion 
Torrent, Personal Genome Machine (PGM), Ion Proton, and Ion S5 sequencers, respec- 
tively. The latest technology Ion S5 system chips (well systems) can perform 2 
million—130 million reads per analysis of 200 nt target DNA sequencing (Ion S5 and 
Ion S5 XL Systems Specification Sheet, 2017; Kircher & Kelso, 2010; Meiklejohn & 
Robertson, 2017). 


Third-generation sequencing systems 


In addition to these methods, new technologies called third-generation sequencing have 
begun to develop. Real-time reading is possible with third-generation sequencing. It is 
faster than the systems of second-generation technology and allows the study of long 
DNA fragments without the need for amplification in PCR. One of the third- 
generation sequencing technologies is the “MinION” instrument produced by Oxford 
Nanopore Technologies. MinION is a technique developed in 2014. Nanopores are 
engineered proteins. A voltage is applied in solution to nanopores made of synthetic 
polymers embedded in an electrically resistive membrane. Characteristic changes occur 
as a result of the current that emerges while applying voltage. In this way, nucleotides 
are identified (Peterson et al., 2019; Wilson et al., 2019). 

Apart from the MinION device, Oxford Nanopore Technologies has produced 
another device called “GridION”. The GridION system detects a large number of 
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nanopores that can be measured and analyzed in real time. MinIONTM is a disposable 
version of GridION (Alvarez-Cubero et al., 2017; McGinn & Glynne Gut, 2013). 


Use of NGS in forensic sciences 


The application of DNA technologies in forensic research has made DNA analysis an 
important tool in forensic science. Compared to other fields of life sciences, forensic 
DNA analysis is an area that needs analysis with high accuracy and reproducibility as it 
has a low copy number and highly degraded and contaminated samples from the crime 
scene (Alvarez-Cubero et al., 2017; Elkins et al., 2021). 

The Sanger methodology has been recognized as the gold standard for DNA 
sequencing. However, CE-based analysis also has disadvantages. The inability to analyze 
multiple genetic polymorphisms in a single reaction using a single workflow, low- 
resolution genotyping of available markers, and the inability to obtain useful genomic in- 
formation in degraded DNA samples are examples of disadvantages (Elkins et al., 2021). 

Analyses are carried out with various types of genetic markers for identification in 
forensic laboratories (autosomal-STR, Y-STR, X-STR, InDel, mtDNA, autosomal 
SNP, X-SNP, and Y-SNP, ancestry informative markers: Ancestry SNPs and phenotypic 
SNPs). Most of these markers can be analyzed with the classical Sanger sequencing 
method in 1 day, while some may take a day or two. However, two different types of 
analysis cannot be done in a single workflow or in a single reaction. For this reason, com- 
mercial kits, which offer the opportunity to work with these markers as an alternative to 
suitable Sanger sequencing systems in NGS technologies, have been developed by com- 
panies and started to be used routinely in forensic laboratories (Elkins et al., 2021). 


Commercial NGS kits used for forensic analysis 


There are NGS panels used in the field of forensic sciences that are routinely used by two 
companies: Thermo Fisher Scientific and the [llumina-Verogen partnership. 


Illumina-Verogen partnership 

The system developed by Iumina in cooperation with the Forensic Sciences community 
with the modifications they made to MiSeq, one of the NGS systems, is known as the 
“Tlumina MiSeq FGx Forensic Genomics System”. This system is completely designed 
for forensic analysis and provides direct results from the DNA sample. Compatible with 
the Illumina MiSeq FGx Forensic Genomics System, Verogen has also developed the 
following five kits used in forensic science (Jager et al., 2017). 

¢ ForenSeq DNA Signature Prep kit. 

¢ ForenSeq Kintelligence kit. 

¢ ForenSeq MainstAY kit. 

¢ ForenSeq mtDNA control region kit. 
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¢ ForenSeq mtDNA whole genome kit. 


ForenSeq DNA Signature Prep Kit 

The ForenSeq DNA Signature Prep Kit is a NGS-based kit designed to be uploaded to 
the National DNA Index System for forensic cases. It contains more than 200 markers 
together with STRs and SNPs: Amelogenin, 27 autosomal STRs, 7 X chromosomal 
STRs, 24 Y chromosomal STRs (Y-STRs), 94 identity-informative SNPs, 22 
phenotypic-informative SNPs, and 56 ancestry-informative SNPs. It is a kit that com- 
bines all markers into one streamlined workflow (Guo et al., 2017). 


ForenSeq Kintelligence Kit 

Forensic genetic genealogy (FGG) analysis provides important contributions within the 
scope of cold cases, missing persons, and innocence projects in forensic sciences. FCG 
analysis combines microarray genotyping and whole-genome sequencing to create 
DNA profiles. It compares the DNA profile in this database using databases such as GED- 
match, which is considered the largest database. It determines whether there are genetic 
relatives of the perpetrators of unsolved events or missing persons among those who 
voluntarily register their DNA information in the database. 

The ForenSeq Kintelligence Kit, which sequences 10,230 SNPs, makes it possible to 
compare the target DNA with the DNA of the individuals in the database. The ForenSeq 
Kintelligence Kit does not analyze disease-related DNA regions. It also includes the con- 
tent of the ForenSeq DNA Signature Prep Kit (identity markers, biogeographic ancestry, 
phenotype determination) as it is used for forensic purposes. It includes 56 Ancestry 
SNPs, 94 Identity SNPs, 9867 Kinship SNPs, 22 Phenotype SNPs, 106 X-SNPs, and 
85 Y-SNPs (ForenSeq Kintelligence Kit, Datasheet, 2021). 


ForenSeq MainstAY Kit 

The ForenSeq MainstAY Kit is a panel that enables profiles to be obtained by analyzing 
25 autosomal STRs (Au-STR) and 27 Y-STRs together in MiSeq FGx Sequencing Sys- 
tems. It aims to reduce the workload and cost in forensic laboratories by combining two 
separate workflows with a single amplification and sequencing. 

STR loci included in the ForenSeq MainstAY Kit: 

AuSTRs: D1S1656, vWA, TPOX, D12S391, D2S1338, Penta E, D3S1358, 
D16S539, D4S2408, D17S1301, FGA, D18S51, D5S818, D19S433, CSF1PO, 
D20S482, D6S1043, D21S11, Penta D, D7S820, D8S1179, D22S1045, D9S1122, 
THO1, D10S1248, amelogenin. 

Y-STRs: DYF387S1, DYS439, DYS19, DYS448, D2S441, D13S317, DYS385a-b, 
DYS460, DYS389I, DYS481, DYS389II, DYS505, DYS390, DYS522, DYS391, 
DYS533, DYS392, DYS549, DYS393, DYS570, DYS437, DYS576, DYS438, 
DYS612, DYS635, Y-GATA-H4, DYS643 (ForenSeq MainstAY Kit, Datasheet, 2021). 
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ForenSeq mtDNA Control Region Kit 

The ForenSeq mtDNA Control Region Kit is a product that enables the analysis of the 
control region of mtDNA with a low amount of DNA, especially suitable for analysis in 
degraded or trace amounts of biological samples. To improve performance on highly 
degraded samples, this kit includes 122 primers of 18 amplicons <150 bp in length, 
recently designed by evaluating mtDNA variant and frequency data (Data Sheet: Fore- 
nSeq mtDNA Control Region, 2019; Holt et al., 2021). 


ForenSeq mtDNA Whole Genome Kit 

The ForenSeq mt Whole Genome Kit is an NGS-based product containing 663 primers 
from 245 amplicons for sequencing the entire mtDNA in challenging samples. The kit 
allows for cross-integrated diversity for genome analyses. Limiting the kit to a low 
amount of DNA (100 pg) may allow an error-free analysis of the 16,569 bp genome 
(ForenSeq mt DNA Whole Genome Kit, Datasheet, 2020; Holt et al., 2021). 


Thermo fisher scientific 

The following 9 kits, which are compatible with the company’s Ion Torrent PGM Sys- 
tem and HID Ion GeneStudio $5 System devices, are used in forensic sciences. 
* Ion AmpliSeq DNA phenotyping panel. 

¢ Ton ampliseq phenotrivium panel. 

¢ Precision ID mtDNA control region panel. 

¢ Precision ID mtDNA whole genome. 

¢ Precision ID ancestry panel. 

¢ Precision ID identity panel. 

¢ Precision ID globalFiler NGS STR v2. 

¢ Yon ampliSeq HID Y-SNP research panel v1. 

* Ton ampliSeq MH-74 plex research panel. 

¢ Ion ampliSeq VISAGE-basic tool research panel. 


lon AmpliSeq DNA Phenotyping Panel 

DNA phenotyping can be performed to aid the investigation in cases where there is 
no match in the DNA database of STR profiles. The lon AmpliSeq DNA Phenotyp- 
ing Panel is used to predict hair and eye color using 24 phenotype SNPs from the 
HIrisPlex system. This panel includes 23 SNPs and an indel from the following genes: 
MCIR, SLC45A2, ASIP/PIGU, EXOC2, HERC2, IRF4, KITLG, OCA2, TYR, 
SLC24A4, and TRYP1. In combination with black, brown, red, or blond hair, it pro- 
vides results showing the probability of a blue, intermediate, or brown eye color 
(Walsh et al., 2013). 
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lon AmpliSeq Pheno Trivium Panel 

In the field of forensic genetics, studies are continuing to use alternative polymorphic re- 
gions to achieve success with challenging samples. In particular, the use of biogeographic 
ancestry and DNA phenotypes can be helpful in investigations in cases where there are 
unsolved cases or no suspects. The lon AmpliSeq PhenoTrivium Panel, containing a total 
of 200 autosomal SNPs and 120 Y chromosomal SNPs in a single panel, includes biogeo- 
graphic ancestry, DNA phenotyping, and male lineage markers. With this kit, it is 
possible to determine lineage and phenotype with 125 pg of genomic DNA (gDNA) 
(Diepenbroek et al., 2020). 


Precision ID mtDNA control region panel 

This is an NGS method for mtDNA analysis developed for forensic applications. mt DNA 
analysis is an important method for DNA analysis when the nuclear DNA analyzed is 
insufficient when samples such as hair, teeth, and bone that do not contain follicles are 
taken from crime scenes, missing person cases, or DVI. It is a multiplex method of 2 pools 
targeting the 1.2 kb (kilobase) control region of the mitochondrial genome, which in- 
cludes HV-I, HV-II, and HV-III. It has been produced as a new sequencing method 
to obtain optimal results from samples that are likely to degrade, such as hair, teeth, 
and bones. This panel is based on Ion AmpliSeq technology (Precision ID mtDNA Panel 
Analysis User Guide, 2016). 


Precision ID mtDNA whole genome panel 

This is a 2-pool multiplex method targeting the entire mitochondrial genome, ie., 
16,569 bp. It contains 81 primer pairs with minimal primer overlap between pools. It 
has been produced as a new sequencing method to obtain optimum results from samples 
that are likely to degrade, such as hair, teeth, and bones (Precision ID mtDNA Panel 
Analysis User Guide, 2016). 


Precision ID ancestry panel 

This panel contains 165 SNPs that provide biogeographic and genealogical information. 
Of the 165 SNP sites, 55 were derived from publications by Kenneth Kidd. The remain- 
ing 123 SNP sites were obtained from publications by Michael Seldin. Generally, the 
amplicons of degraded samples are less than 130 bp. However, with this panel, degraded 
samples can be studied (Kidd, 2015, pp. 1-80; Pakstis et al., 2010). 


Precision ID identity panel 
This panel is one of the first NGS systems developed for identification and uniquely iden- 
tifies samples that have degraded or are in good condition. It is a panel with high 
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discrimination power using 34 upper Y-Clade SNPs and 90 autosomal SNPs, for a total 
of 124 with high heterozygosity. It is used for degraded or challenging forensic samples. 
This kit’s discrimination of individuals is similar to STR genotype match probabilities 
used in forensic sciences (between 1 x 10°’ and 6 x 10°”) (Karafet et al., 2008; 
Kidd, 2015, pp. 1—80; Pakstis et al., 2010; Phillips et al., 2007). 


Precision ID Globalfiler NGS panel v2 
This panel includes a total of 35 markers, 20 CODIS core STR regions, 11 nonCODIS 
STRs, and 4 gender regions to aid in the interpretation of challenging case examples. 
20 CODIS STR regions: TPOX, D3S1358, FGA, D5S818, CSF1PO, D7S820, 
D8S1179, THO1, vWA, D13S317, D16S539, D18S51, D21S11, D1S1656, D2S1338, 
D2S8441, D10S1248, D128391, D19S433, and D22S1045; 
11 nonCODIS STR regions: D1S1677, D2S1776, D3S4529, D4S2408, D5S2800, 
D6S1043, D6S474, D12ZATA63, D14S1434, Penta E, Penta D, and 4 sex regions: Ame- 
logenin, DYS391, SRY, and Y-indel (Ragazzo et al., 2020; Tao et al., 2019). 


lon AmpliSeq HID Y-SNP research panel v1 

It is an NGS-based kit developed for ancestry and paternal biogeographic ancestry using 
Y chromosomal SNPs and Y chromosomal haplogroups. This panel allows the analysis of 
859 phylogenetic Y-SNPs, 640 of which are Y haplogroups (Arwin et al., 2019; Claerh- 
out et al., 2021). 


lon AmpliSeq MH-74 plex research panel 

It is a 157—325 bp panel consisting of a total of 230 SNPs containing 74 more microha- 
plotypes in addition to 130 microhaplotypes previously optimized by the Kidd lab. 
Microhaplotypes are particularly suitable for the analysis of mixed samples among forensic 
samples and also contain SNPs that determine biogeographical ancestry. It is a panel con- 
taining alternative polymorphic regions for cases where conventional STR capillary elec- 
trophoresis typing fails. It is possible to obtain a full profile with 50 pg of DNA (Oldoni 
et al., 2020). 


lon AmpliSeq VISAGE-basic tool research panel 
In forensic science, DNA phenotyping is done to aid the investigation when there is no 
DNA database or no match in the database. The investigation can be assisted by finding 
data on a person’s phenotype, including appearance characteristics and biogeographic 
ancestry. 

The Ion AmphiSeq VISible Attributes through GEnomics Basic Tool Research Panel 
includes 41 SNPs of the HIrisPlex-S assay for phenotype and biogeographic ancestry, 
with an additional 115 biogeographic ancestry markers capable of detecting population 
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differentiation from 7 continents, fora total of 153 markers. A full profile can be obtained 
with 100 pg of gDNA with this kit (Arwin et al., 2019; Xavier et al., 2020). 


Conclusion 


NGS technologies, also known as massively parallel sequencing (MPS), have been used in 
genetic research since the 2000s. Various methods have been developed that are faster, 
more efficient, and less costly for NGS. As NGS methods and platforms have developed, 
the quality of sequences has increased, and in recent years, forensic genetics laboratories 
have turned to this technology. NGS technology allows the analysis of combining many 
systems (STRs, SNPs, and mRNA) at the same time. In addition, with this technology, 
DNA database creation, ancestry and phenotype determination, identification of iden- 
tical twins in the forensic field, distinction, identification of body fluids, and species, 
and microbiology analysis can be performed. 

NGS for forensic genetic analysis has many advantages over the traditional CE 
method, although the process of transition to laboratories has been difficult and there 
have been some difficulties in its applications. It has also not completely replaced conven- 
tional CE technology and kits. Its difficulties include the inability to create validated and 
powerful kits, high equipment costs, limited NGS training of analysts, validation, and 
optimization studies. 

However, in recent years, both researchers and commercial companies have acceler- 
ated their studies on this subject, allowing NGS to become an established system in the 
field of forensic sciences. The development of commercial panels not only enabled the 
use of an alternative method to the CE system in forensic genetic analysis but also 
gave direction to investigations with forensic analyses at a point where CE technology 
was insufficient. Owing to their advantages, such as the application of forensic genetic 
studies of more than one technique in a single analysis, they will appear as systems 
used in a large part of forensic laboratories, shortly haplogroups next-generation to type. 
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Introduction 


DNA sequencing for forensic purposes is currently almost exclusively performed by 
second-generation sequencing methods (Bruijns et al., 2018). Verogen and Thermo 
Fisher Scientific are among the most important providers, both relying on sequencing- 
by-synthesis (SBS). Verogen commercializes the MiSeq FGx Sequencing System, based 
on the Illumina platform. Thermo Fisher Scientific distributes the Ion Torrent systems 
(Churchill et al., 2016; Zhang et al., 2017). Both technologies yield high-accuracy reads 
at sufficient throughput. However, their widespread introduction into routine forensic 
use is hampered by some limitations. Firstly, SBS-based systems are short read sequencing 
systems with a maximal read length of about 500 nucleotides, making assay design and 
read alignment of long short tandem repeats (STR) loci more challenging. Secondly, 
the acquisition of such a technology requires a large capital investment. Third, the cost 
per sample is pronouncedly higher compared to capillary electrophoresis (CE). More- 
over, highly trained staff is required for sample and library preparation. Lastly, analysis 
is performed in centralized laboratories. 

The introduction of nanopore sequencing, a third-generation sequencing system, in 
the field of forensic DNA genotyping avoids the need for a large capital investment. Ox- 
ford Nanopore Technologies (ONT) commercializes a handheld sequencing device, the 
MinION, at a negligible cost (Jain et al., 2016). This long-read sequencing yields reads 
spanning more than 500 nucleotides. Moreover, the device’s small footprint allows for 
sequencing in remote locations, e.g., at a crime scene. Devices capable of yielding higher 
throughputs than the MinION device, namely the GridION and the PromethION, are 
also available on the market. These devices are more suited for centralized laboratories 
performing routine forensic DNA analysis. 

Although the general concept of nanopore sequencing dates back to an idea sketched 
in 1989 by David Deamer in his notebook, nanopore sequencing became available only 
in 2014 (Deamer et al., 2016). Nevertheless, this technology is rapidly being explored in 
the forensic field. In this chapter, we will discuss the nanopore sequencing technology in 
detail, followed by an overview of possible applications in forensic investigations. Lastly, 
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we will elaborate on the challenges that should be overcome for nanopore sequencing to 
become a routine application in forensics. 


Nanopore sequencing 


In contrast to most other systems available on the market, nanopore sequencing does not 
rely on sequencing by synthesis. Instead, individual DNA molecules are sequenced in 
real-time without the intervention of a polymerase enzyme. The heart of this technology 
consists of an array of nanopores, which are small apertures embedded in an electrically 
resistant membrane. As shown in Fig. 6.1, this membrane separates two reservoirs filled 
with ionic fluid from each other. Upon applying a voltage over the membrane, a current 
of ions will flow through these nanopores. Nanopore sequencing is performed by thread- 
inga DNA molecule through a nanopore. A motor protein is attached to the nanopore to 
regulate the speed at which a DNA strand is translocated through the pore and to unwind 
double-stranded DNA. The presence of a DNA molecule alters the ionic current by 
partially blocking the aperture. As each nucleotide modifies the current in a slightly 
different way, the obtained current profile, called a ‘squiggle’, can be translated to a 
nucleotide sequence (Kasianowicz et al., 1996). This process is called base calling and 
is performed by machine learning algorithms. Advanced computational methodologies 
are necessary, as at each moment in time, a sequence of 5—7 nucleotides is present in 


lonic 
Current 


Time 


Figure 6.1 General principle of nanopore sequencing. Left: a motor protein, attached to a nanopore 
protein, unwinds a dsDNA strand. One of both strands is translocated through the nanopore, which is 
embedded in a voltage-biased membrane. The presence of a DNA strand alters the ion flow through 
the nanopore. Right: The resulting ionic current can be used to decode the DNA sequence, as the 
disruption of the ionic current is characteristic for each nucleotide passing through the pore. 
(Created with BioRender.com.) 
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a pore, and thus a combination of nucleotides dominates the current signal. This makes 
the deconvolution of a squiggle into a nucleotide sequence extremely challenging. 

ONT uses engineered proteins as nanopores. In general, two types of nanopores are 
currently used for sequencing, i.e., the R9 and the R10 pores. For each type of pore, 
multiple iterations have been commercialized. As nanopore sequencing can still be 
considered a relatively recent technology, ONT is continuously working on improving 
its technology. At the time of writing, the state-of-the-art pore types were R9.4 and 
R10.4. The R10-series of nanopores is more recent than the R9-series and is character- 
ized by a dual reader head, as shown in Fig. 6.2 (Van der Verren et al., 2020). Sequencing 
using the R9 nanopores, which only contain a single reader head, proved to be trouble- 
some for DNA strands containing homopolymers (Eskola et al., 2017). Threading a ho- 
mopolymer through the nanopore gives rise to a constant current signal over a certain 
time period, challenging the base calling process. The dual reader head of the R10- 
series was designed to improve the base calling accuracy of homopolymers as more nu- 
cleotides dominate the signal. 

Besides different types of nanopores, ONT also offers a broad range of sequencing 
instruments, each characterized by its throughput. The smallest device is the 
MinION, which weighs less than 0.5 kg. It accommodates the MinION flow cells, 
containing an array of 2048 nanopores and the electronics needed for sequencing 
(Jain et al., 2016). Five hundred twelve of these nanopores can be sequenced in parallel. 
More recently, the Flongle adapter and Flongle flow cells were introduced. Flongle 
flow cells only contain 126 channels and can be used on a MinION device by using 
the Flongle adapter. The electronics needed for sequencing are embedded in this reus- 
able adapter, lowering the cost significantly. MinION and Flongle flow cells are also 
compatible with the GridION device, a benchtop device capable of running 5 flow 


Figure 6.2 Comparison of R9 and R10 nanopores. An R9 nanopore is characterized by a single reader 
head, which is, in essence, a constriction. As the R10 nanopores have a dual reader head, more nucle- 
otides dominate the signal. (Created with BioRender.com.) 
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cells in parallel. For projects that require high throughputs, a PromethION device and 
compatible flow cells are most suited. One PromethION can run 24 (PromethION 24) 
or 48 (PromethION 48) flow cells in parallel, each capable of using 2675 nanopores in 
parallel. 


Forensic applications of nanopore sequencing 
Short tandem repeat genotyping 


STRs are the most commonly used marker in forensic genetics (Butler, 2005). The ad- 
vantages of sequencing STRs instead of performing length-based analysis using CE are 
well known (Bruins et al., 2018; Borsting & Morling, 2015). MPS yields a more infor- 
mative output, as besides length polymorphisms, sequence polymorphisms can also be 
detected, e.g., single nucleotide polymorphisms (SNPs) in the repeat region and the sur- 
rounding regions. As a consequence, multiple research groups are investigating the per- 
formance of STR genotyping by nanopore sequencing. Unfortunately, due to their 
highly repetitive nature, STRs are challenging targets for sequencing technologies. 
Moreover, the read quality of nanopore reads is still lower compared to, e.g., [lumina 
sequencing (Rang et al., 2018). The first report on nanopore sequencing of forensic 
STRs, published by Cornelis et al. demonstrated those issues (Cormelis et al., 2018). After 
performing a 14-locus multiplex PCR, the resulting amplicons were randomly concat- 
enated. This additional step was required as the minimal read length at the time was 100 
nucleotides, which exceeded the length of some amplicons. Sequencing was performed 
using an R7 flow cell, which is no longer available. Only partial forensic profiles could be 
obtained due to a high sequencing error rate. Moreover, a particularly high error rate was 
observed in homopolymeric regions, mostly partial deletions. These findings were 
corroborated by Asogawa et al. (Asogawa et al., 2020). 

The frequently targeted STR loci were originally selected for analysis using CE. 
Hence, not all STR loci are equally suited for genotyping by sequencing. In order to 
assess why genotyping of some specific STR loci after nanopore sequencing is inaccurate, 
Tytgat et al. identified locus-dependent characteristics that compromise accurate geno- 
typing (Tytgat et al., 2020). These findings can assist in constructing an STR panel suit- 
able for nanopore sequencing. Firstly, the presence of homopolymers in the repeat region 
proved cumbersome. Besides, high repeat numbers, the presence of partial repeats, and 
complex repeat patterns showed to be problematic for nanopore sequencing. Lastly, 
sequence similarity between the repeat region and the flanking region hampers correct 
alignment to reference alleles. Nevertheless, genotyping a subset of STR loci after nano- 
pore sequencing could reliably be performed for single-contributor samples. In the same 
report, Tytgat and colleagues compared the genotyping accuracy after sequencing using 
the R9.4 versus the R10 version of nanopores. Although sequencing using the R10 flow 
cell improved the genotyping accuracy for loci characterized by the presence of 
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homopolymers, the overall genotyping accuracy was lower due to an increased level of 
random sequencing errors. 

More recent publications focus on using a commercially available NGS sequencing 
kit to perform amplification prior to nanopore sequencing. Ren et al. and Tytgat et al. 
used the ForenSeq DNA Signature Prep Kit (Ren et al., 2021; Tytgat et al., 2021), 
commercialized by Verogen. Hall and colleagues used the Promega PowerSeq 46GY 
System (Hall et al., 2021). Owing to the continuous improvement of nanopores, kits, 
and base-calling software, most STR loci can now accurately be genotyped after nano- 
pore sequencing. Moreover, the added value of sequencing STR loci instead of length- 
based analysis is now harnessed. A successful demonstration of iso-allele genotyping was 
obtained after amplification using the ForenSeq DNA Signature Prep Kit (Tytgat et al., 
2021). Hall and colleagues (2021) included the detection of SNPs within the flanking 
region in their data analysis pipeline, STRspy. More than 90% of the SNPs occurring 
in these flanking regions could be genotyped correctly in their study (Hall et al., 2021). 

An additional important aspect of improving genotyping accuracy is the workflow 
used for data analysis. Most reports describe the construction of a reference database con- 
taining all possible alleles of all loci targeted by the assay. After base-calling, reads are 
assigned to their corresponding loci, either by primer recognition or mapping against 
the human reference genome. In the next step, the obtained reads are aligned against 
the database of reference alleles. Each read is assigned to the best-matching allele. 
Optionally, an alignment score filter can be employed to discard reads affected by ampli- 
fication, sequencing, and alignment errors. The resulting read counts are then used for 
genotyping. Developing a robust, validated data-analysis workflow is crucial, as exempli- 
fied by Tytgat et al. (2021). Two workflows were compared on an identical dataset, 
resulting in pronounced differences in genotyping accuracy. A drawback to these work- 
flows is their dependence on a reference database. All possible (iso-)alleles should be pre- 
sent in this database. The presence of an allele in the sample that is not recorded in the 
database will lead to incorrect genotyping. Expanding the existing databases containing 
STR sequences, such as STRSeq (Gettings et al., 2017), with sequence data gathered by 
the community will be crucial. Although a considerable improvement over the course of 
a few years can be noted, further advancements should be realized for this technology to 
be implemented in routine use. Moreover, large developmental validation studies are 
needed to establish locus-dependent data-analysis parameters, e.g., allelic imbalance 
cut-offs. 


Single nucleotide polymorphism genotyping 

SNPs are an interesting marker for forensic research, although they are less commonly 
investigated than STRs. SNPs are particularly useful for the analysis of highly degraded 
DNA, as they can be analyzed from short amplicons (Borsting et al., 2013). Moreover, 
they are less prone to mutation, and they can reveal some phenotypic traits of the sample 
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donor (Butler et al., 2007). Unfortunately, due to their biallelic nature, SNPs are char- 
acterized by a lower discriminatory power compared to STRs, therefore necessitating 
higher-order multiplexing (Butler et al., 2007). The SNPforID consortium developed 
a multiplex targeting 52 SNP markers, thereby approaching the discriminatory power 
of the routinely used STR panels consisting of 10-15 loci (Sanchez et al., 2006). Cornelis 
and colleagues used this panel to assess the accuracy of SNP genotyping based on nano- 
pore reads (Cornelis et al., 2017). After sequencing using an R9.4 flow cell, all but one of 
the loci were genotyped correctly. However, for some loci, achieving robust genotyping 
after nanopore sequencing was problematic. All of these loci contained a homopolymer 
in the region flanking the targeted SNP, again indicating how troublesome homopoly- 
mers are for nanopore sequencing. 

In a follow-up study, Cornelis and colleagues investigated a panel targeting 16 trial- 
lelic SNPs (Cornelis et al., 2019). Nonbinary SNP panels are characterized by a higher 
discriminatory power compared to biallelic SNP panels. Moreover, analyzing multiallelic 
SNPs enables the reliable detection of mixture samples (Phillips et al., 2004, pp. 27-29). 
Four out of five investigated samples could be genotyped correctly. Again, some prob- 
lematic loci were identified, which should preferably be avoided during the design of 
panels dedicated to nanopore sequencing. This latest study by Cornelis and colleagues 
dates back to 2018. In the following years, the read accuracy of nanopore sequencing 
has improved. Two studies dating from 2021, as mentioned above, assessed the genotyp- 
ing accuracy of nanopore sequencing after amplification using the ForenSeq DNA Signa- 
ture Prep Kit (Ren et al., 2021; Tytgat et al., 2021). Besides autosomal and nonautosomal 
STRs, this kit targets 94 SNP loci, thereby combining the advantages of both types of 
markers. Both studies reported an SNP genotyping accuracy of > 99%, indicating 
that nanopore sequencing is ideally suited to perform forensic SNP genotyping. Incorrect 
genotyping was mainly related to low coverage, which might be explained by the 
extremely high degree of multiplexing. Despite its higher error rate than [umina 
sequencing, nanopore sequencing can accurately genotype forensic SNPs when suffi- 
ciently high coverage is obtained. 


Mitochondrial DNA genotyping 


As discussed above, human DNA profiling is mostly performed by analyzing STRs and 
SNPs present in nuclear DNA. However, in cases where nuclear DNA is highly 
degraded, analysis of mitochondrial DNA (mtDNA) might provide some valuable infor- 
mation. The mitochondrial genome is about 16.5 kb long and is characterized by a very 
high copy number per cell (Anderson et al., 1981). As mtDNA is maternally inherited, all 
maternal relatives share the same mtDNA sequence, apart from mutations. It should be 
noted that the mtDNA polymerase has lower fidelity, resulting in a higher mutation rate 
of mtDNA compared to nuclear DNA. The maternal inheritance makes mtDNA gen- 
otyping extremely useful for mass disaster victim identification, as maternal relatives 
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can donate a sample to serve as a reference, in case no reference DNA of the victim is 
available (Budowle et al., 2003). In most cases, the hypervariable regions I and II 
(HV1 and HV2) are assessed (Wilson et al., 1995). As nanopore sequencing is a long- 
read sequencing system, the entire mtDNA genome can be sequenced more easily, 
thereby reducing the error in haplogroup assignment (King et al., 2014; Parsons & Coble, 
2001). This increased discriminatory power is particularly of interest for discerning a 
mixture sample from a heteroplasmic sample. Heteroplasmy is the condition in which 
a cell harbors a mixture of nonidentical mtDNA copies and is thus difficult to distinguish 
from a mixture sample (Stefano et al., 2017). 

Only limited research has been conducted on mtDNA nanopore sequencing for 
forensic purposes. Lindberg and colleagues successfully phased variants of a 1:1 mixture 
sample by combining Illumina short-read sequencing and nanopore long-read 
sequencing (Lindberg et al., 2016). A few years later, when the nanopore sequencing 
technology was more mature, Zascavage and colleagues described nanopore sequencing 
of mtDNA without any preenrichment step, such as PCR (Zascavage et al., 2019). 
Omitting the PCR amplification step eliminates any sequence errors introduced during 
PCR and amplification bias. An overall error rate of approximately 1.00% was reported 
for the complete mitochondrial genome. A considerable portion of these errors could be 
attributed to homopolymeric stretches. From this, it is clear that nanopore sequencing is a 
promising tool for forensic mtDNA analysis. 


Nonhuman forensics 
Species identification 
Species identification by DNA analysis is commonly used in a forensic setting, e.g., to aid 
in reconstructing a crime in which animals are involved or in cases of trafficking in pro- 
tected species. To realize species identification, ‘DNA barcoding’ is mainly used. This 
methodology consists of a PCR step, during which standardized genetic regions are 
amplified, followed by sequencing of the resulting amplicons (Hebert et al., 2003). 
The obtained reads are then compared to a reference database containing the sequences 
of the standardized genetic regions of several species. This way, the sequences serve as a 
unique barcode for a specific species. Currently, sequencing is performed mainly by 
Sanger sequencing or NGS technologies. Nanopore sequencing offers some benefits 
to this field. First, long-read sequencing might increase the robustness of species identi- 
fication workflows, which is needed to discriminate closely related species (Vasiljevic 
et al., 2021). Second, the portability of the MinION device enables species identification 
at remote locations, which can be useful in many cases. Lastly, the low cost of the plat- 
form enables laboratory-based sequencing in regions where access to costly NGS devices 
is limited. 

As the error rate of nanopore reads is higher compared to most NGS methods, a well- 
performing data-analysis workflow is crucial to obtain highly accurate consensus 


91 


92 


Next Generation Sequencing (NGS) Technology in DNA Analysis 


sequences, which can be compared to reference databases. Sahlin et al. developed a bio- 
informatics pipeline dedicated to species identification by third-generation sequencing 
strategies such as ONT and PacBio, called NGSpeciesID (Sahlin et al., 2021). Reads 
are first filtered based on quality and size, followed by a clustering step of similar reads. 
A consensus read is then constructed, followed by the merging of reverse-complement 
consensus reads. Finally, a polishing step is performed to obtain the final consensus 
sequence. Developmental validation of the NGSpeciesID pipeline after performing 
nanopore sequencing for forensic genetic species identification was performed by Vasil- 
jevic and colleagues (Vasiljevic et al., 2021). The authors concluded that the construction 
of an accurate and reliable single consensus sequence is feasible using the MinION 
sequencer. Moreover, they consider the species identification results obtained using their 
workflow robust enough for forensic purposes. 


Microbial analysis 

Microbial forensics has been defined by Budowle et al. as “the scientific discipline dedi- 
cated to analyzing evidence from a bioterrorism act, biocrime, or inadvertent microor- 
ganism/toxin release for attribution purposes” (Budowle et al., 2003). Probably the 
most well-known example is the so-called anthrax letter attack. In 2001, letters contain- 
ing spores of Bacillus anthracis (B. anthracis) were posted to, among others, two US sena- 
tors. The attack resulted in 22 infections, of which 5 people died. Therefore, B. anthracis 1s 
a frequently targeted organism within the field of microbial forensics. Other high-priority 
pathogens are Clostridium botulinum, which causes botulism, and Yersinia pestis (Y. pestis), 
which causes plague. All these pathogens have been analyzed by nanopore sequencing 
(Gonzalez-Escalona & Sharma, 2020). Gargis et al. performed whole-genome 
sequencing for B. anthracis and Y. pestis and could even detect different antimicrobial 
resistance genes (Gargis et al., 2019). One of the advantages of nanopore sequencing 
in this field is that real-time information can be obtained during sequencing. McLaughin 
and colleagues performed nanopore whole-genome sequencing of B. anthracis a couple of 
hours after receipt (McLaughlin et al., 2020). This clearly illustrates the potential value of 
nanopore sequencing in emergency situations where fast answers are needed. Although 
some of the abovementioned studies combine nanopore sequencing with a short-read 
sequencing platform to accurately analyze the genome, nanopore sequencing is capable 
of detecting the targeted pathogens. 

Besides its applications in bioterrorism, microbial forensics is also applied to analyze 
the human microbiome in a forensic context. Ongoing research tries to retrieve a variety 
of forensic information from the microbiome of samples. The most used sample type for 
this application is touch samples, which are compared to the skin microbiome of an in- 
dividual of interest (Schmedes et al., 2017). Second, postmortem interval estimation 
based on the microbiome has been explored, as repeatable shifts in microbial community 
composition occur during body decomposition (Metcalf, 2019). Next, fingerprinting the 
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microbiome detected in the soil might provide insights into geographical provenance 
(Giampaoli et al., 2014). In some instances, analysis of the microbiome can reveal the 
cause of death. An example is the detection of bacterioplankton as an indicator of 
drowning (Rutty et al., 2015). Lastly, the tissue or body fluid identity might be retrieved 
from the microbiome (Hanssen et al., 2017). 

Several interesting potential forensic applications exist for microbiome sequencing. 
This field is challenging, because of the huge inter- and intraindividual variability (Lopez 
et al., 2021). Nanopore sequencing could offer some technical advantages in this field as 
long reads deliver accurate identification of closely related species and can sequence infor- 
mative genes (e.g., 16S rR NA) in their entirety, improving the resolution of identifica- 
tion (Lopez et al., 2021). Unfortunately, to our knowledge, no research on the analysis of 
the human microbiome for forensic purposes has been performed yet by nanopore 
sequencing. Nevertheless, nanopore sequencing has proven to allow the analysis of com- 
plex microbial samples, as reviewed by Ciuffreda and colleagues (2021). Therefore, 
nanopore sequencing might play a role in the further development and validation of 
the abovementioned assays. 


Opportunities 
Messenger RNA analysis 


The performance and possible advantages of nanopore sequencing over current tech- 
niques have not been researched in several niche forensic applications. A first example 
is messenger RNA analysis, which can be employed for body fluid identification. The 
composition of a biological sample found at a crime scene can reveal crucial information 
for an ongoing investigation. mR. NA-based assays have been developed for the detection 
of semen, (menstrual) blood, cervicovaginal fluid, saliva, and skin, as described by Roeder 
etal. (Roeder & Haas, 2016, pp. 13-31). This is currently realized mainly by performing 
multiplex PCR, followed by profiling using CE. The nanopore platform offers CDNA 
kits to analyze mRNA found at a crime scene. A proof-of-concept of how nanopore 
sequencing can be used to perform body fluid identification was obtained by Dunn 
and Fleming (2019). They successfully identified salivary gland and tongue mRNAs 
by sequencing both direct mRNA and cDNA. 


Epigenetics and epigenomics 


An advantage of nanopore sequencing is its capability to directly distinguish methylated 
and unmethylated cytosines (Rand et al., 2017). DNA methylation analysis can be used 
to determine the cell or tissue type (Frumkin et al., 2011), but it can also be applied for 
age estimation (Park et al., 2016). Moreover, differentiation between monozygotic twins 
was obtained by analyzing the so-called epigenetic fingerprint (Lévesque et al., 2014). In- 
terest in analyzing the epigenome for forensic purposes is growing. This should enable 
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lifestyle and environmental exposure prediction, thereby providing law enforcement ser- 
vices with extra-valuable information on unknown perpetrators. 

Nanopore sequencing offers the advantage of analyzing nucleotide modifications 
without the need for chemical or enzymatic conversions, or PCR. Multiple tools have 
been developed to decipher the methylation state of a DNA or RNA molecule. A sys- 
tematic benchmarking of these tools reveals pronounced differences in the prediction re- 
sults of methylation frequencies between tools (Yuen et al., 2021). In this research, 10 
regions of forensic interest were sequenced using a Cas9-targeted nanopore sequencing 
protocol. These regions included ancestry-informative SNPs, phenotypic-informative 
SNPs, and autosomal STRs. Although a lot of research is still required in both the accu- 
rate prediction of the methylation state and the interpretation of these data in a forensic 
context, nanopore sequencing is a promising tool for forensic epigenomic analysis. 


Challenges 


Nanopore sequencing is still a relatively new technology, as it was only introduced in 
2014 (Deamer et al., 2016). It did not take long for the forensic field to start investigating 
the potential of this new technology. In 2016, Zaaijer and colleagues reported a fast 
method for the identification of DNA samples using the MinION sequencer by 
sequencing SNPs at low coverage without any prior amplification step. A Bayesian algo- 
rithm could then compare the obtained results to a reference database, taking into ac- 
count the error rate of the sequencing platform (Zaaijer et al., 2016). A few years 
later, the focus of most research in the field shifted toward the identification of individuals 
by STR genotyping, which is still the most commonly used method for DNA profiling. 
The results obtained throughout the years by different research groups on STR genotyp- 
ing by nanopore sequencing perfectly demonstrate the improvements realized by ONT. 
Whereas the first report could only obtain partial profiles due to the very high sequencing 
error rate (Cornelis et al., 2018), single-contributor samples can now successfully be 
analyzed (Hall et al., 2021; Tytgat et al., 2021). Nevertheless, some hurdles should be 
overcome for nanopore sequencing to find its way toward routine use in the field. 
The first important challenge is the raw read accuracy, which is still lower than Ilu- 
mina sequencing. This results in noisier genotyping data (Tytgat et al., 2020), which is 
undesirable in a forensic setting where often challenging samples, e.g., low-input or 
mixture samples, are analyzed. ONT is continuously increasing the raw read accuracy 
by improving its pores, sequencing kits, and base callers. At the time of writing, the 
R10.4 version was the most recent flow cell version, with a modal accuracy above 
99% reported when used in combination with the so-called ‘Q20-+’ sequencing kit. 
Moreover, a high consensus accuracy for homopolymeric regions was obtained (Sereika 
et al., 2022). Other interesting approaches to increase the read accuracy focus on the 
amplification step prior to sequencing. Rolling circle amplification results in multiple 
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copies of the target region within one read, allowing the construction of a consensus read 
based on these separate subreads originating from the same molecule in the original sam- 
ple (Wilson et al., 2019). A second strategy is the use of unique molecular identifiers 
(UMIs), which allow for the creation of a consensus read based on all reads tagged 
with the same UMI (Karst et al., 2021). These results are promising for all possible appli- 
cations of nanopore sequencing within the field of forensics. 

A second challenge is the portability of the library preparation steps. Although the 
MinION device is small and perfectly portable, library preparation usually requires basic 
laboratory tools. In order to obtain a sample-in-answer-out system, ONT developed the 
VoITRAX, an automated sample and library preparation device. This device can mix re- 
agents in a controlled way, perform bead purification, and perform quality control of the 
sample by measuring the UV absorption profile at 260 and 280 nm. According to ONT, 
future versions of the controlling software will enable PCR (VolTRAX V2 Technical 
Specification, 2021). Implementing the entire library preparation workflow, including 
PCR, will be crucial for on-site DNA profiling in a forensic setting. 


Conclusions 


Nanopore sequencing is a relatively new long-read sequencing technology. The 
MinION sequencer is a portable, low-cost device. These features offer many opportu- 
nities in the field of forensic DNA genotyping. Most research has focused on the iden- 
tification of individuals by genotyping STRs and SNPs in view of the portability of the 
system, which would enable DNA fingerprinting at the crime scene. A second interesting 
application is whole mtDNA sequencing, exploiting the long nanopore reads that can 
span an entire mitochondrial genome. Third, nanopore sequencing can be used for 
DNA methylation detection without any prior enzymatic or chemical conversion step, 
allowing tissue type determination. Other interesting applications are nonhuman foren- 
sics and body fluid identification. The raw read accuracy has improved tremendously 
because of the continued improvement of the flow cells, the sequencing kits, and the 
base-calling software. Furthermore, the VoITRAX device will enable on-site library 
preparation with limited human intervention and virtually no need for other equipment. 
We believe that these innovations will push nanopore sequencing toward routine use in 
the field of forensic DNA genotyping. 
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Introduction to forensic DNA analysis 


Forensic DNA analysis is the process of analyzing and interpreting genetic materials 
to assist in investigating the identity of a perpetrator or victim, particularly in criminal 
investigations. Forensic DNA analysis involves the use of scientific techniques to 
extract, purify, and analyze DNA from biological samples collected from suspects, 
victims, and crime scenes. The analysis can take various forms, such as DNA 
sequencing, polymerase chain reaction (PCR), and STR (short tandem repeat) anal- 
ysis. The sample types can include blood, tissue, bones, saliva, semen, hair, and other 
body fluids (Dash et al., 2020). 

Once the DNA has been extracted and purified from the sample type, it is analyzed to 
produce a DNA profile of the individual whose DNA was collected. The DNA profile is 
generated by comparing the collected sample’s genotype against a known reference sam- 
ple, usually from a suspect or a database. The profiles can then be compared to identify 
relationships between people, such as siblings or parents and children, and to determine 
the presence or absence of a suspect at the crime scene (Butler, 2010c). 

Forensic DNA analysis has played a vital role in solving crimes and implicating sus- 
pects. DNA evidence is considered the most reliable form of evidence because it is 
unique to each individual, and the technology can detect even minute quantities 
found at crime scenes. Furthermore, DNA evidence can provide conclusive evidence 
in instances where other forms of evidence may not be able to establish guilt beyond a 
reasonable doubt (Butler, 2010a). 

The advent of DNA databases has further increased the significance of forensic DNA 
analysis in criminal investigations. Databases composed of DNA profiles from people 
convicted of certain crimes or submitted voluntarily have led to the identification of 
criminals who may otherwise have gone undetected. Additionally, the inclusion of 
DNA evidence has helped exonerate individuals who were wrongly accused or con- 
victed of crimes (Butler, 2010b). 
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DNA typing techniques used in forensic DNA analysis 


Capillary electrophoresis (CE) is a powerful analytical technique that has revolutionized 
DNA analysis in forensic science. The introduction of CE in the early 1990s has signif- 
icantly improved the speed, sensitivity, and accuracy of DNA analysis. CE can be used to 
separate and analyze complex mixtures of DNA samples in a single run, making it an effi- 
cient and cost-effective alternative to other traditional methods (Behl et al., 2021). 

CE is based on the principle of electrokinetic separation, where charged molecules are 
separated by the application of an electric field across a narrow capillary tube filled with a 
buffer solution. The migration of molecules is determined by their charge-to-mass ratio, 
size, and shape. By adjusting the conditions of the electric field and buffer solution, 
different types of DNA fragments can be separated and analyzed. 

CE is widely used in forensic DNA analysis for its ability to analyze small and 
degraded DNA samples, as well as mixed samples containing DNA from multiple sour- 
ces. It is commonly used in the following areas: 

1. STR analysis: Short Tandem Repeat (STR) analysis is a standard procedure for 
forensic DNA analysis. CE is used to separate STR alleles and obtain accurate 
genotypes. 

2. Mitochondrial DNA analysis: CE can be used to analyze mtDNA fragments, which 
can be useful in cases where DNA is degraded or present in limited quantities. 

3. Y-chromosome analysis: CE is used to analyze Y-chromosome-specific markers to 
determine the male ancestry of an individual. 

4. Mixed DNA samples: CE can help to separate DNA from two or more individuals in 
a mixed sample to determine the contributor’s profiles. 


Advantages of capillary electrophoresis in forensic DNA analysis 


The use of CE in forensic DNA analysis offers several advantages: 

1. High resolution and sensitivity: CE can separate DNA fragments precisely, allowing 
for accurate identification of genotypes. 

2. Cost-effectiveness: CE reduces the cost of analysis by enabling the separation and 
detection of multiple DNA fragments in a single run (multiplex amplification). 

3. Efficient use of resources: With CE, researchers can analyze DNA samples with min- 
imal volumes. 
Fast turnaround time: CE can complete DNA analysis in a shorter time than tradi- 

tional methods (Butler et al., 2004). 


Limitations of capillary electrophoresis in forensic DNA analysis 


Despite the numerous advantages, CE has some limitations in forensic DNA analysis, 

including: 

1. Equipment costs: The cost of purchasing and maintaining the necessary equipment 
can be expensive. 
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2. Sample preparation: Proper sample preparation is critical since impurities can interfere 
with DNA separation and detection. 

3. Inter-laboratory variation: There may be differences in CE protocols, reagents, and 
data analysis, which can lead to variations in results. 

CE is a powerful tool for forensic DNA analysis and has become an essential part of 
DNA profiling in criminal investigations. With its high-resolution separation and sensi- 
tivity, CE offers significant advantages over traditional methods, especially in cases 
involving limited or mixed DNA samples. Nevertheless, optimization of protocols and 
continual quality assurance measures are necessary to achieve accurate and reproducible 
results (Fekete & Schmitt-Kopplin, 2007). 


DNA sequencing 


Next-Generation Sequencing (NGS) technologies have revolutionized the field of geno- 
mics by enabling the rapid, accurate, and cost-effective sequencing of whole genomes. 
This has led to a plethora of applications in various fields including personalized medi- 
cine, forensics, and population genetics (Hou et al., 2013). One such application is the 
analysis of microhaplotypes, which are small clusters of closely linked single nucleotide 
polymorphisms (SNPs) that can be used for various purposes including parentage testing, 
disease diagnosis, and population genetics studies. In contrast to STRs, microhaplotypes 
can be analyzed using NGS technology. This provides the ability to analyze multiple ge- 
netic markers at once, increasing the amount of information that can be gathered from a 
single DNA sample. 

NGS-based microhaplotypes analysis is an emerging approach that harnesses the po- 
wer of NGS to identify and analyze specific genetic markers known as microhaplotypes. 
Microhaplotypes are short DNA fragments that vary in length from 50 to 300 base pairs 
and contain several SNPs in close proximity. Due to their small size and high level of 
variability, microhaplotypes are ideal for use in forensic, evolutionary, and population ge- 
netics studies. 

This technique involves using targeted sequencing to amplify and sequence specific 
microhaplotypes from DNA samples. This approach offers several advantages over tradi- 
tional genotyping methods, including increased resolution, sensitivity, and accuracy, as 
shown in Table 7.1. By analyzing multiple microhaplotypes at once, researchers can 
obtain a more comprehensive view of an individual’s genetic makeup and better under- 
stand the relationships between different populations. 

As sequencing technologies continue to improve and new bioinformatic tools and 
standards are developed, we can expect to see even more exciting discoveries and appli- 
cations in the years ahead. These advancements in molecular biology techniques have 
revolutionized the forensic science domain in the last few decades. Traditional methods 
such as polymerase chain reaction (PCR)-based SNP assays and STR typing have been 
widely used for forensic application. However, several limitations associated with these 
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Table 7.1 Comparison of NGS-based microhaplotypes analysis to traditional methods. 
NGS-based microhaplotypes 


analysis PCR-based SNP assays _ STR typing 
Sample amount | Low (ng range) High (ug range) Low 
Multiplexing High (100—1000 markers in one | Low Low 
run) 
Resolution High High Low 
Sensitivity High (detects minor contributors) | Low Low 
Discrimination High Low High 


Genomic region | Specific Specific Nonspecific 


methods have led to the development of newer methods such as Next-Generation 
Sequencing (NGS)-based microhaplotypes analysis. In this chapter, I will compare and 
contrast the advantages and disadvantages of NGS-based microhaplotypes analysis with 
traditional DNA typing techniques. 


Microhaplotypes as a tool for human identification 
Definition of microhaplotypes 


Human identification is an essential aspect of forensic science, and over the years, 
different molecular techniques have been employed to achieve this goal. Forensic 
DNA analysis plays a crucial role in criminal investigations and identifying missing per- 
sons; also, it is an essential tool for law enforcement agencies to identify suspects, link 
them to a crime scene, and exonerate innocent individuals. Short Tandem Repeats 
(STRs) have been the gold standard for human identification due to their multi-allelic 
nature and high variability (Butler, 2010d). 

The development of NGS technologies has revolutionized the field of genomics and 
has opened up new avenues for forensic applications. One of the promising applications 
of NGS in forensic science is the identification of individuals based on their microhaplo- 
types (Wang et al., 2015). 

However, implementing large-scale SNP assays in forensic labs still faces difficulties. 
Microhaplotypes (MH) or microhaps, consisting of nearby genetic markers, have 
emerged as a feasible and outstanding alternative to both STRs and Single Nucleotide 
Polymorphisms (SNPs) in human identification (Fan et al., 2022). 

Microhaplotypes, as a novel tool for human identification, have been gaining 
increasing popularity in the field of forensic science (Turchi et al., 2017). With their 
high levels of variability and informativeness, microhaplotypes have been shown to be 
effective in resolving complex forensic cases and have the potential to improve the effi- 
ciency and accuracy of crime investigations. This chapter explores the various applica- 
tions of microhaplotypes in forensic science, including those related to criminal 
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investigations, ancestry inference, and disaster victim identification. The potential bene- 
fits of microhaplotypes in forensic science make them a topic of great interest and impor- 
tance for researchers, forensic experts, and law enforcement professionals alike (Staadig & 
Tillmar, 2021). 

The technology has brought about a paradigm shift in genetic research, particularly in 
the field of forensic genetics, where it has enabled the development of new tools for 
human identification and ancestry inference. One such tool is microhaplotype analysis, 
which uses genetic markers that are shorter than traditional markers, making them 
more informative and discriminatory (Carratto et al., 2022; Turchi et al., 2017). 

Microhaplotypes are short DNA fragments that are highly informative in identifying 
individuals and can be sequenced in a high-throughput manner using NGS. In the 
following paragraphs, we will discuss the utility of NGS for microhaplotypes analysis 
in human identification and explore the challenges and opportunities associated with 
this approach. 


Types of compound genetic markers in microhaplotypes 


Compound biomarkers consisting of two or more variants that occur within a small 
region can be regarded as generalized microhaplotypes, and include SNPs closely 
linked to STRs (SNP-STRs), insertion and deletion polymorphisms (indels) closely 
linked to STRs (DIP-STRs), indel polymorphisms closely linked to SNPs (DIP- 
SNPs). DIP-STRs and SNP-STRs are two types of compound genetic markers in 
microhaplotypes. DIP-STRs identified through MPS assays can be easily genotyped 
using CE platforms present in every forensic lab. Also, sequence-based information 
from SNP-STRs and DIP-STRs is important in designing allele-specific PCR 
primers, enabling the amplification of alleles from the minor DNA contributor that 
are absent in the genetic background of the major DNA source, thus eliminating 
the masking effect of the main DNA contributor (Jian et al., 2021). 


Criminal investigations 


One of the most important applications of microhaplotypes in forensic science is in crim- 
inal investigations. The high variability of microhaplotypes, coupled with their ability to 
provide reliable individual identification, make them an excellent tool for resolving com- 
plex cases. For instance, the analysis of microhaplotypes can allow forensic experts to 
differentiate between different individuals in cases where traditional DNA profiling fails 
due to sample degradation or complex mixtures of DNA (Fan et al., 2022). Furthermore, 
microhaplotypes can help resolve cases involving related individuals, such as siblings or 
half-siblings, where traditional DNA profiling may not provide sufficient discriminatory 
power. The allelic combinations of microhaplotypes can be observed in the same ampli- 
con, even when only bi-allelic SNPs are present, providing multiple genotypes from a 
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single locus. This unique feature of microhaplotypes makes them highly efficient for the 
analysis of complex DNA mixtures. 


Ancestry inference 


Another area where microhaplotypes have shown great promise is in ancestry inference. 
By analyzing specific sets of microhaplotypes that are known to be associated with 
different ethnic groups, forensic experts can determine an individual’s ancestral origin 
with a high degree of accuracy. This information can then be used to aid criminal inves- 
tigations, as it can provide valuable clues about the possible identity of suspects. 


Disaster victim identification 


Microhaplotypes also hold great potential in disaster victim identification. In cases where 
traditional DNA profiling fails due to the degradation of DNA samples, microhaplotypes 
can be used to provide reliable and fast identification of victims. Furthermore, because 
microhaplotypes require relatively short DNA sequences, they can be amplified from 
degraded or damaged DNA samples, making them an ideal tool for identifying victims 
in mass disasters. 


Challenges and limitations of microhaplotypes 


Despite the advantages of NGS-based microhaplotypes, there are several challenges and 
limitations that need to be considered. One potential limitation is the need for large pop- 
ulation databases containing microhaplotype data in order to establish reliable frequency 
estimates (Standage & Mitchell, 2020). Additionally, bioinformatics tools and algorithms 
for the analysis of microhaplotype data are still undergoing development, which may 
affect the accuracy and reliability of results. 

One of the main challenges is the choice of the microhaplotype panel, which should 
be optimized for the specific research question and the target population. The panel 
should contain informative markers that allow for accurate haplotype phasing, while 
minimizing the genotyping error rate and the risk of missing variation (Kidd et al., 
2018). Therefore, there is a need for standardization and validation of the methods 
and markers used for microhaplotypes analysis. The use of standardized protocols and 
markers is critical for the reproducibility and reliability of the results. 

Another challenge is the bioinformatic processing, which can introduce biases and er- 
rors (Li et al., 2009). For example, alignment errors can lead to false-positive SNPs or 
haplotypes, while filtering criteria can exclude true variation or result in ascertainment 
bias. Therefore, it is important to apply rigorous quality control measures and to validate 
the results using independent methods. There is also a need for specialized bioinformatics 
tools and pipelines for the analysis of NGS data. The high volume of data generated by 
NGS can be overwhelming, and the development of efficient algorithms and analytical 
tools is essential for accurate and efficient analysis of microhaplotypes. 
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Finally, the interpretation of NGS-based microhaplotypes requires careful consider- 
ation of the genetic and evolutionary processes that shape the genetic variation. These 
processes can affect the pattern of linkage disequilibrium (LD), the distribution of allele 
frequencies, and the demographic history of the population. Therefore, it is crucial to 
integrate the genetic data with other sources of information, such as historical records, 
ecological data, and morphological traits (Bardan, 2019). 

Although the use of microhaplotypes has been shown to be effective in many cases, 
there is still a need for further testing and validation before widespread adoption in 
forensic science. 


Advantages of microhaplotypes over other DNA typing techniques 


One of the main advantages of microhaplotypes is their high level of polymorphism, 
which allows for highly informative analysis of DNA samples. The random match 
probability (RMP) for a large panel of microhaplotypes can be as low as 10—100, 
making microhaplotypes a highly reliable method for individualization. Addition- 
ally, the heterozygosity of a microhaplotype can be measured as the effective num- 
ber of alleles (Ae), with African populations having an Ae > 7.5 for some loci, and 
Native American populations having an Ae > 4.5 for a smaller panel of two dozen 
selected microhaplotypes (Kidd & Pakstis, 2022). 

Microhaplotypes are also useful in ancestry inference, with at least 10 different ances- 
tral clusters being definable by microhaplotypes when analyzed using structure. Further- 
more, the Ae for a locus is identical to the Paternity Index (PI), which measures the 
informativeness of a locus in parentage testing. High Ae loci can also be useful in missing 
persons cases, where identifying a match with a close relative can help narrow down the 
search for the missing person (Kidd & Pakstis, 2022). 

In mixture deconvolution, microhaplotypes offer a significant advantage over 
traditional STRs due to their high level of polymorphism. High Ae microhaplotypes 
can allow for the near certainty of seeing multiple additional alleles in a mixture 
of two or more individuals in a DNA sample. This makes microhaplotypes highly 
useful in cases where multiple contributors are present, such as in sexual assault cases 
(Kidd & Pakstis, 2022). 

One of the main advantages of microhaplotypes in forensic DNA analysis is their abil- 
ity to detect mixtures of multiple sources in a DNA sample. When all alleles are equally 
frequent, the probability of detecting three or more alleles in a mixture is at maximum. 
Effective number of alleles (Ae), a classical population genetics concept, can be used to 
convert unequal allele frequencies into a value equivalent to the number of equally 
frequent alleles, allowing microhaplotype loci to be ranked. The Ae value at a locus 
can convert unequal allele frequencies into an equivalent value of equally frequent 
alleles, allowing microhaplotype loci to be ranked. Microhaplotypes with Ae values of 
>3 are highly useful in ordinary forensic practice, with 3-SNP microhaplotypes 
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sometimes meeting this criterion, and 4-SNP microhaplotypes exceeding this criterion 
with values >4 (Tillmar & Phillips, 2017). Testing multiple microhaplotypes can exceed 
a 95% probability of detecting mixtures with as few as five loci with average Ae values 
slightly greater than 3.0 (Kidd & Speed, 2015). 

Microhaplotypes can also be highly informative in forensic identification. Using a 
panel of higher Ae microhaplotypes, it is possible to outperform the standard CODIS 
markers. 

Microhaplotypes for Biogeographical Ancestry: Using STRUCTURE, at least 10 
different ancestral clusters can be defined by microhaplotypes, making them highly infor- 
mative for biogeographical ancestry inference. 


Applications of microhaplotypes in forensic DNA analysis 


The massive parallel DNA sequencing (MPS) technologies are now making microhaplo- 
types a new type of forensic marker: a single sequence read can cover the expanse of the 
microhaplotypes and these loci become phase-known (Turchi et al., 2017, 2019). 

Microhaplotypes are short DNA fragments that contain two or more closely linked 
SNPs within a small genomic region. These SNPs are highly polymorphic and can be 
used to differentiate individuals based on their unique haplotype profiles. Unlike other 
forensic markers such as STRs and SNPs, microhaplotypes are not affected by mutation, 
genetic drift, or recombination, and hence, offer a stable and reliable marker for human 
identification. They are useful for noninvasive prenatal paternity tests using cell-free fetal 
DNA (cff{DNA) (Carratto et al., 2022). 


Next-generation sequencing for microhaplotypes analysis 


NGS has revolutionized the field of genomics and the way we understand genetic diver- 
sity and evolution. Enabling high-throughput sequencing of whole genomes or targeted 
regions, providing a wealth of information about population dynamics, genetic diversity, 
and phylogenetic relationships. 

One of the most promising applications of NGS is the analysis of microhaplotypes. 
The use of NGS for microhaplotypes analysis offers several advantages over traditional 
methods. First, NGS allows for the simultaneous sequencing of thousands of microhaplo- 
types, providing a more comprehensive and informative profile for human identification. 
Second, NGS can provide higher accuracy and precision in the sequencing of microha- 
plotypes, reducing the possibility of errors and false identifications. Third, NGS can also 
identify novel and rare microhaplotypes, which may not be detected using traditional 
methods (Bruins et al., 2018). 

Microhaplotypes are short DNA fragments that are typically less than 300 base pairs 
(bp) in length and contain multiple SNP markers (Kidd et al., 2022; Kidd & Pakstis, 2022; 
Staadig & Tillmar, 2021). These SNPs are situated along the DNA molecule in a manner 
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that allows them to be inherited together as a block, leading to high levels of LD. This 
characteristic makes microhaplotypes ideal for forensic applications, as they can provide 
greater discrimination than traditional markers in certain populations. 

Due to the small size of microhaplotypes, they are amenable to low-coverage 
sequencing, making them cost-effective and efficient for genetic analysis. Microhaplo- 
types have several advantages over individual SNPs, including higher resolution for dis- 
tinguishing closely related individuals, greater power to detect recombination events, and 
lower genotyping error rates. In recent years, there has been growing interest in using 
microhaplotypes for forensic applications, population genetics, and medical research 
(Turchi et al., 2017). 


Workflow of NGS-based microhaplotypes analysis 


The workflow for NGS-based microhaplotypes analysis involves several key steps, 
including sample preparation, library preparation, sequencing, alignment, and data 
analysis. Each step has its considerations and challenges, which require careful attention 
to ensure optimal results. By following best practices and using appropriate methods 
and tools, researchers can obtain high-quality sequencing data and derive meaningful 
insights from their analyses. In this section, we will provide an overview of each of 
these steps and discuss some of the challenges and considerations involved in each 
step (Gandotra et al., 2020). 


Sample preparation 

The first step in the workflow is sample preparation, which involves collecting a biolog- 
ical sample from the individual of interest and extracting DNA. There are various 
methods available for DNA extraction, including manual and automated protocols. 
The choice of extraction method depends on the quality and quantity of the starting ma- 
terial, the downstream applications, and the laboratory resources. 

One of the critical considerations in sample preparation is minimizing contamination 
and preserving the integrity of the DNA sample. Contamination can lead to false-positive 
results, while degraded DNA can affect the quality of the sequencing data. To ensure 
optimal sample quality, it is essential to follow best practices, such as using sterile equip- 
ment, minimizing handling, and storing samples at the appropriate temperature and 
humidity. 


Library preparation for NGS-based microhaplotypes analysis 

Once the DNA has been extracted, the next step is library preparation, which involves 
fragmenting the DNA and attaching adapters that enable sequencing on the NGS plat- 
form. Several methods are available for DNA fragmentation, including sonication, enzy- 
matic digestion, and mechanical shearing. The choice of fragmentation method depends 
on the size and quality of the DNA sample and the downstream application. 
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After fragmentation, the DNA fragments are ligated with adapters that contain spe- 
cific barcodes that enable multiplexing of multiple samples in a single sequencing run. 
Multiplexing can increase the throughput and efficiency of the sequencing process by 
reducing the number of sequencing runs needed to analyze a large set of samples (Gan- 
dotra et al., 2020). 


Sequencing 

After library preparation, the next step is sequencing, which involves loading the libraries 
onto the NGS platform and acquiring sequence data. The choice of sequencing platform 
and sequencing depth depends on the research question, the type of analysis, and the 
available resources. For microhaplotype analysis, low-coverage sequencing is typically 
sufficient, which reduces the cost and time required for sequencing (Wang et al., 2015). 


Alignment 

After sequencing, the next step is alignment, which involves mapping the sequence reads 
back to a reference genome or a custom reference panel. Alignment is necessary to deter- 
mine the location and identity ofthe SNP markers within the microhaplotypes. Several soft- 
ware programs are available for alignment, such as BWA, Bowtie, and HISAT2. The choice 
of alignment software depends on the type of analysis and the sequencing platform used 
(Li et al., 2009). 


Data analysis for NGS-based microhaplotypes analysis 

The final step in the workflow is data analysis, which involves calling and interpreting the 
SNP markers within the microhaplotypes. The analysis can be performed using various 
software programs, such as GATK, FreeBayes, and SAMtools. The analysis typically in- 
volves filtering the SNPs based on quality scores, allele frequency, and linkage disequi- 
librium. The filtered SNPs are then combined into haplotypes, which can be used for 
downstream applications, such as forensics or population genetics (Pang et al., 2020). 


Validation of NGS-based microhaplotypes analysis 


Data analyzing of microhaplotypes requires careful consideration of the data quality, 
sequencing depth, and bioinformatic processing steps. In this section, I will describe 
the workflow for Data analysis of NGS-based microhaplotypes, including quality con- 
trol, read alignment, SNP calling, haplotype reconstruction, and statistical inference. I 
will also discuss the challenges and limitations of NGS-based microhaplotypes and stra- 
tegies for validating the results (Pang et al., 2020). 


Quality control 


The first step in analyzing NGS-based microhaplotypes is to assess the quality of the raw 
sequencing data. This includes evaluating the read length, base quality, and overall 
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sequencing depth. Low-quality reads should be removed, and adapter sequences should 
be trimmed to ensure accurate read alignment. In addition, technical replicates and pos- 
itive controls should be included to monitor the reproducibility and accuracy of the 
sequencing results. 


Read alignment 


The next step is to align the cleaned reads to a reference genome or a custom-designed 
panel of microhaplotype sequences. Several popular alignment tools, such as BWA, 
Bowtie, and STAR, are available for this purpose. The choice of the alignment tool de- 
pends on the specific research question and the characteristics of the sequencing data, 
such as read length, sequencing depth, and the presence of indels. 


SNP calling 


Once the reads are aligned, the next step is to identify SNPs within the microhaplotypes. 
This can be done using various SNP callers, such as GATK, FreeBayes, and SAMtools. 
The quality of the SNP calls should be evaluated based on several criteria, such as the read 
depth, mapping quality, and allele frequency. False-positive SNPs can arise from 
sequencing errors, PCR duplicates, or rare variants. Therefore, it is important to filter 
out low-quality SNPs or those with low coverage. 


Haplotype reconstruction 


After SNP calling, the microhaplotypes can be phased and reconstructed into haplotypes 
using several methods, such as BEAGLE, HaploView, and SHAPEIT. Phasing refers to 
the process of determining which alleles belong to the same chromosome, while haplo- 
type reconstruction infers the sequence of the two haplotypes for each individual. The 
accuracy of haplotype reconstruction depends on the number of SNPs per microhaplo- 
type, the sequencing depth, and the LD structure of the markers. LD refers to the 
nonrandom association between nearby markers and can affect the accuracy of haplotype 
phasing. 


Statistical inference 


Once the haplotypes are reconstructed, they can be used for various downstream ana- 
lyses, such as population structure, admixture, and relatedness estimation. These analyses 
rely on statistical models that account for the uncertainty in haplotype phasing and geno- 
type calling. Several software packages, such as STRUCTURE, ADMIXTURE, and 
KING, are available for these analyses. It is important to validate the results using inde- 
pendent datasets or alternative methods, such as SNP genotyping, microsatellite typing, 
or mitochondrial DNA sequencing. 


Challenges and limitations related to data analysis 


Despite their promise, there are still challenges associated with microhaplotype analysis. 
For one, there is no consensus on how to define and select microhaplotypes (Kidd et al., 
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2018). Different studies use different methods and criteria for defining microhaplotypes, 
which can lead to inconsistencies in results. Moreover, microhaplotype analysis requires 
complex statistical analyses and interpretation, which can be difficult for nonexperts. 

As with any new technology, there are limitations and challenges that need to be 
addressed in order to improve the quality and reliability of the data generated by 
NGS-based microhaplotypes analysis. 

NGS-based microhaplotype analysis still faces several challenges and limitations. 
One major challenge is the need for accurate and efficient bioinformatic pipelines to 
process and analyze the large amounts of sequencing data generated by NGS plat- 
forms. These pipelines must be able to handle errors and biases introduced by 
PCR amplification, sequencing chemistry, and other sources, while also detecting 
and filtering out artifacts such as chimeras and contaminants. Developing robust bio- 
informatic pipelines is an ongoing area of research, and several approaches have been 
proposed, including alignment-based methods, de novo assembly, and machine 
learning algorithms. 

Another challenge for NGS-based microhaplotype analysis is the need to validate and 
standardize the assays used for SNP selection and PCR amplification. There is currently 
no standardized method for selecting and validating microhaplotype assays, and different 
approaches may produce different results depending on factors such as sample size, SNP 
density, and genotyping platforms. To address this issue, several consortia and working 
groups have been formed to develop guidelines and best practices for microhaplotype 
analysis, including the MicroHapDB consortium and the Forensic Genomics Con- 
sortium (Standage & Mitchell, 2020). 


Future directions and opportunities in NGS-based microhaplotypes 
analysis 


Despite these challenges, the use of NGS for microhaplotypes analysis in human 
identification presents several opportunities. The ability to sequence thousands of 
microhaplotypes simultaneously using NGS can provide a more comprehensive 
and informative profile for human identification, which can improve the accuracy 
and reliability of the results. NGS can also identify novel and rare microhaplotypes, 
which can increase the power of discrimination and improve the resolution of the 
analysis. Additionally, NGS can provide a more efficient and cost-effective method 
for the analysis of microhaplotypes, which can be particularly useful in large-scale 
forensic investigations. 


The promises of microhaplotypes and advantages over traditional markers 


Microhaplotypes offer several advantages over traditional genetic markers, as shown in 
Table 7.1. First, they are shorter than other markers, which makes them less prone to 
degradation and more stable in degraded DNA samples. 
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Second, they avoid the problems commonly seen in STR assays, such as stutter peaks/ 
reads, which sometimes complicate Massive Parallel Sequencing (MPS) technology in 
forensic genetics. Stutter peaks/reads can lead to difficulties in the identification and 
deconvolution of mixtures when the difference between the major and minor DNA do- 
nors is high. This phenomenon occurs when the polymerase enzyme responsible for 
replicating the DNA strand during amplification pauses, creating a shorter fragment 
with one fewer repeat unit. This can lead to difficulties in accurately identifying alleles 
and may compromise the reliability of DNA analysis. On the other side, microhaplotypes 
do not exhibit stutter peaks/reads, making them easier to analyze and interpret in forensic 
DNA analysis (Oldoni et al., 2020; Standage & Mitchell, 2020). 

Third, they can provide higher levels of discrimination than traditional markers, 
particularly in genetically diverse populations. This is because microhaplotypes consist 
of nearby genetic markers that can provide multiple genotypes from a single locus, 
even when only biallelic SNPs are present. This provides more information from a single 
DNA sample than SNPs can offer (Carratto et al., 2022). 

Fourth, they are more informative, as they can provide information about both the 
maternal and paternal sides of an individual’s family tree. 

Fifth, they provide higher resolution, increased accuracy, and greater throughput. 
Analysis enables researchers to observe genetic variation at an unprecedented level of 
detail, providing insights into population genetics, phylogenetics, and forensic 
investigations. 

Sixth, microhaplotypes can also be used to identify individuals, genetic disorders, 
and disease susceptibility factors, which could lead to more effective medical treat- 
ments and personalized therapies. Sixth, they are highly sensitive, requiring as little 
as 50 pg of DNA sample which is equivalent to the DNA found in eight cells; this 
makes them ideal for analyzing degraded DNA samples from challenging forensic con- 
texts (Oldoni et al., 2020). In addition, it can differentiate mixtures with 2 contributors 
up to a 49:1 ratio and up to 6 DNA contributors; additionally, the number of contrib- 
utors can be accurately defined in 2- to 4-person mixtures in more than 95% of the 
evaluated samples, and more than 83% accuracy for 2- to 6-person mixtures (Yang 
et al., 2022). This is an important feature in forensic investigations where DNA samples 
are often mixtures of genetic material from multiple individuals. These properties make 
microhaplotypes an outstanding alternative to both SNPs and STRs in human identi- 
fication. To deal also with nonhuman DNA analysis and biogeographical ancestry cases 
(Pang et al., 2020). 

Eighth, Microhaplotypes have the potential to identify biogeographical ancestry. A 
good set of Ancestry-Informative Markers (AIMs) organized into microhaplotypes would 
grant not only ancestry information but also human identifying potential as well. The 
major worldwide population groups (Africans, South-West Asians and Europeans, 
East-Asians, Americans, and the Pacific Islanders) can be defined with only 31 microha- 
plotypes (Oldoni et al., 2020). 
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This means that microhaplotypes could be used to provide both ancestry information 
and human identifying potential, which is particularly relevant in forensic investigations 
(Carratto et al., 2022). 


Challenges of NGS-based microhaplotypes technique 

Low coverage depth and variability 

One of the main challenges in NGS-based microhaplotypes analysis is the low coverage 
depth and variability of the data generated by the technology. This can result in missing 
data, inaccurate genotyping, and false positives or negatives. The variability of the data 
also makes it difficult to compare results across different samples or platforms. 


Read errors and artifacts 

Another challenge is the presence of read errors and artifacts that can result in incorrect 
genotyping or false positives. These errors can result from sequencing errors, PCR biases, 
or sample contamination. 


Data management and analysis 

Managing and analyzing the large amounts of data generated by NGS-based microhaplo- 
types analysis can also be challenging. This requires sophisticated computational tools and 
expertize in bioinformatics and statistical analysis. 


Improvements to NGS-based microhaplotypes analysis 

Enhancing coverage depth and consistency 

To improve the accuracy and reliability of NGS-based microhaplotypes analysis, one 
approach is to increase the coverage depth and consistency of the data generated by 
the technology. This can be achieved through various methods including sample pool- 
ing, targeted capture, and hybridization-based sequencing. 


Reducing read errors and artifacts 

To reduce the impact of read errors and artifacts on the accuracy of NGS-based micro- 
haplotypes analysis, various approaches can be used including error correction algorithms, 
quality control measures, and validation of results using independent methods. 


Developing and optimizing computational tools 

To manage and analyze the large amounts of data generated by NGS-based microhaplo- 
types analysis, advanced computational tools need to be developed and optimized. This 
requires expertize in bioinformatics, statistics, and computer science. 

There are challenges and limitations that need to be addressed to improve the ac- 
curacy and reliability of the data generated by the technology. By enhancing the 
coverage depth and consistency, reducing read errors and artifacts, and developing 
advanced computational tools, we can improve the quality and efficiency of 
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NGS-based microhaplotypes analysis and enable its wider adoption in research and 
clinical settings. 


Future directions in microhaplotype analysis 


Despite the challenges, microhaplotype analysis holds great promise for the future of 
forensic genetics (Carratto et al., 2022; Turchi et al., 2017). Here are some future direc- 
tions that researchers should explore: 

1. Defining microhaplotypes: Since there is currently no consensus on how to define 
muicrohaplotypes, researchers should work toward establishing a standard definition 
and selection criteria for microhaplotypes (Kidd & Speed, 2015). 

2. Integrating microhaplotypes with other tools: Microhaplotype analysis could be com- 
bined with other next-generation sequencing—based tools, such as ancestry inference 
and phenotype prediction, to obtain even more powerful and informative results 
(Kidd et al., 2022). 

3. Automation and standardization of analysis: With advances in automation technol- 
ogy, it should be possible to automate most of the data analysis and interpretation steps 
of microhaplotype analysis. Additionally, standardization of protocols and analytical 
pipelines could help to improve the reproducibility and comparability of results across 
different laboratories. 

While the field still faces challenges such as defining microhaplotypes and interpreting 

results, advances in technology and collaboration can help drive the field forward into a 

future where microhaplotype analysis becomes a staple of forensic genetics research. 


Ethical considerations for NGS-based microhaplotypes analysis 


As with any new technology, NGS-based microhaplotypes analysis raises ethical consid- 
erations that must be taken into account to ensure the responsible and fair use of this 
powerful approach. In this section, we will explore the ethical issues surrounding 
NGS-based microhaplotypes analysis. We will examine the potential risks and benefits 
of this technology, discuss the obligations of scientists, researchers, and policymakers, 
and propose guidelines for the responsible use of NGS-based microhaplotypes analysis 
(Martinez-Martin & Magnus, 2019). 


Risks of NGS-based microhaplotypes analysis 


As with any technology, there are risks associated with this application, such as privacy 
concerns, discrimination, and the potential for misuse. The information obtained from 
NGS-based microhaplotypes analysis could be used to discriminate against individuals 
on the basis of their genetic profile, leading to potential harms such as loss of employ- 
ment, denial of insurance coverage, or social ostracism. In the wrong hands, genetic 
data could also be used for nefarious purposes, such as identity theft, fraudulent activities, 
or targeted marketing (Butler, 2023). 
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Obligations of scientists, researchers, and policymakers 


Scientists, researchers, and policymakers have a moral obligation to ensure that NGS- 
based microhaplotypes analysis is used in a responsible and ethical manner. This includes 
respecting the privacy and autonomy of individuals, protecting sensitive genetic informa- 
tion from unauthorized access, and promoting equitable and fair access to genetic ser- 
vices. To achieve these goals, it is essential to establish clear guidelines and standards 
for the collection, storage, and use of genetic data, as well as to educate individuals about 
the risks and benefits of genetic testing (Davey, 2014). 


Guidelines for the responsible use of NGS-based microhaplotypes analysis 


To ensure the responsible and ethical use of NGS-based microhaplotypes analysis, we 

propose the following guidelines: 

1. Informed consent: Individuals must provide informed consent before any genetic 
testing is conducted, and they should be provided with clear and comprehensive in- 
formation about the risks and benefits of the testing, as well as their rights to privacy 
and confidentiality. 

2. Privacy and confidentiality: Genetic data must be kept confidential and protected 
from unauthorized access. Appropriate security measures must be implemented to 
safeguard sensitive information. 

3. Data sharing: Genetic data should be shared only for legitimate research purposes, and 
proper protocols and safeguards should be implemented to minimize the risk of 
misuse. 

4. Nondiscrimination: Genetic information must not be used to discriminate against in- 
dividuals on the basis of their genetic profile, and appropriate legal protections should 
be put in place to prevent such discrimination. 

5. Public education: The public must be educated about the risks and benefits of genetic 
testing, including the importance of protecting privacy and confidentiality. 

To ensure that this technology is used in a responsible and ethical manner, it is essen- 
tial to address the potential risks and benefits, and to establish clear guidelines and stan- 
dards for the collection, storage, and use of genetic data. With careful consideration and 
responsible practices, NGS-based microhaplotypes analysis can contribute to scientific 
progress while respecting the privacy and autonomy of individuals (National DNA Data- 
base Ethics Group, 2017). 


Potential applications of NGS-based microhaplotypes analysis 


One of the key applications of NGS-based microhaplotypes analysis is in forensic science. 
Microhaplotypes have been shown to be highly informative for individual identification, 
making them valuable tools for criminal investigations and paternity testing. In addition, 
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microhaplotypes can be used to determine the ancestry of an individual, which can pro- 
vide important contextual information for forensic investigations. 

NGS-based microhaplotypes analysis also has important implications for evolutionary 
and population genetics research. By analyzing microhaplotypes from multiple individ- 
uals or populations, researchers can gain insights into patterns of genetic diversity and 
identify regions of the genome that are under selective pressure. This information can 
be used to better understand the genetic basis of complex traits and diseases and to 
develop targeted approaches for disease diagnosis and treatment. 

Microhaplotypes have emerged as a promising tool for identifying biogeographical 
ancestry. Ancestry informative markers (AIMs) are genetic markers that vary in frequency 
between different populations and can be used to infer an individual’s ancestry. Tradi- 
tionally, AIMs have been identified using SNPs or STRs, but microhaplotypes offer a 
new alternative for the identification of ancestry. A good set of AIMs organized into 
microhaplotypes would grant not only ancestry information but also human identifying 
potential. 

Several studies have shown that microhaplotypes can accurately identify major pop- 
ulation groups. In a study by Oldoni et al. (2019), the authors identified 31 microhaplo- 
types that could define the major worldwide population groups, including Africans, 
South-West Asians and Europeans, East-Asians, Americans, and Pacific Islanders. These 
microhaplotypes were able to correctly assign individuals to their respective population 
group with an accuracy of over 99%. 

The ability of microhaplotypes to identify biogeographical ancestry can have impor- 
tant applications in forensic investigations. For example, in cases where the identity of an 
individual is unknown, analysis of their DNA could provide information about their 
likely ancestry. This information could be used to narrow down the pool of potential 
suspects and help focus the investigation. 

Microhaplotypes can also be useful in cases where human remains are found and there 
are no immediate clues to the identity of the individual. Analysis of the DNA from the 
remains could provide information about their ancestry, which could be used to help 
identify the individual. Additionally, the use of microhaplotypes could be particularly 
useful in cases where the individual’s ancestry is not immediately clear, such as cases 
involving individuals with mixed ancestry. 

Furthermore, microhaplotypes can also be used in combination with other forensic 
DNA analysis techniques, such as STRs, to increase the power of human identification. 
Ina study by Zhu et al. (2019), the authors demonstrated that a combination of 28 micro- 
haplotypes and 20 STRs provided high discriminatory power and ancestry information 
for a Chinese population sample. This combined approach could provide greater accu- 
racy in identifying individuals and could be particularly useful in cases where traditional 
methods of identification have failed. 
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Future directions of NGS-based microhaplotypes analysis 


In this section, we will explore the current state-of-the-art in NGS-based microhaplo- 
type analysis, highlight key challenges and opportunities, and outline future directions 
for research in this exciting field. 


NGS-based microhaplotype analysis: state-of-the-art 


The first step in NGS-based microhaplotype analysis is the selection of informative SNPs 
that are suitable for multiplex PCR amplification and subsequent sequencing. Several 
methods have been developed for SNP selection, ranging from genome-wide association 
studies (GWAS) to targeted panels of SNPs based on functional annotation or population- 
specific frequencies. The most efficient approach for SNP selection is likely to involve a 
combination of these methods, taking into account factors such as linkage disequilibrium, 
allele frequency, and quality control metrics. Once informative SNPs have been identi- 
fied, they can be amplified by multiplex PCR, and the resulting amplicons can be 
sequenced using NGS platforms such as Illumina or PacBio (Kidd & Pakstis, 2022). 

NGS-based microhaplotype analysis offers several advantages over traditional 
methods such as capillary electrophoresis and Sanger sequencing. First, NGS enables 
the analysis of thousands or even millions of individual readouts per sample, providing 
much higher resolution than previous methods. Second, NGS allows for the detection 
of rare variants and low-frequency alleles that may be missed by other techniques. Third, 
NGS-based microhaplotype analysis requires relatively small amounts of DNA, making it 
feasible to use degraded or low-quality samples such as those obtained from forensic in- 
vestigations or archeological remains. 


Opportunities for future research in NGS-based microhaplotype analysis 


In a Staadig and Tillmar study, the authors evaluated the performance of a custom-made 
QIAseq Microhaplotype panel that includes 45 different microhaplotype loci. The panel 
was evaluated using 75 samples of Swedish origin, and the haplotype frequencies were 
established. The authors also conducted sensitivity studies to determine the limit of 
detection of haplotypes and found that they could detect haplotypes at input amounts 
down to 0.8 ng. Mixture samples were also tested, and haplotypes for the minor contrib- 
utor were detectable down to the level of 1:100. This demonstrates the high sensitivity of 
muicrohaplotypes and their potential use in mixture analysis, which is a common problem 
in forensic practice. Furthermore, the authors performed kinship simulations to evaluate 
the usefulness of the panel in kinship analysis. The results showed that both paternity and 
full sibling cases can clearly be solved. The authors noted that there were some primer 
design issues that could be optimized to increase the power of the assay. Nonetheless, 
the results of this study are promising for further implementation of this microhaplotype 
assay into the forensic field (Staadig & Tillmar, 2021). 
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Finally, there are several opportunities for future research in NGS-based microhaplo- 
type analysis. One promising area involves the use of microhaplotypes for population ge- 
netics and evolutionary studies. By analyzing microhaplotype variation in large and 
diverse populations, researchers can gain insights into the evolutionary history and migra- 
tion patterns of human and animal species, as well as identify regions of the genome un- 
der selection. Another area of research involves the development of new sequencing 
technologies and platforms that can provide even higher resolution and accuracy than 
current methods. For example, nanopore sequencing and single-molecule sequencing 
technologies offer the potential for real-time, long-read sequencing of individual mole- 
cules, enabling the detection of structural variants, haplotype phasing, and epigenetic 
modifications. 


Conclusion 


In conclusion, next-generation sequencing has revolutionized the field of genomics by 
providing researchers with the ability to sequence large amounts of genetic material 
quickly and accurately. 

Microhaplotypes have emerged as a feasible alternative to both STRs and SNPs in 
forensic DNA analysis. They offer advantages in sensitivity, exclusion power, and 
discrimination power. A thorough selection of microhaplotypes can establish a single 
MPS assay to deal with a variety of forensic DNA analysis needs. Additionally, compound 
genetic markers such as DIP-STRs and SNP-STRs are getting attention in microhaplo- 
types analysis, making them a promising tool for forensic DNA analysis. Their high level 
of polymorphism allows for highly informative analysis of DNA samples, with a low 
RMP and high Ae values in certain populations. 

As technology advances and more research is conducted in this field, it is likely that 
microhaplotypes will become even more important in forensic DNA analysis. 

Microhaplotypes have shown great potential in identifying biogeographical ancestry. 
They offer a new alternative to traditional methods of identifying AIMs and can accu- 
rately define major worldwide population groups with a small set of markers. Microha- 
plotypes have important applications in forensic investigations and can be used in 
combination with other DNA analysis techniques to increase the accuracy of human 
identification. 

However, there are still some limitations and challenges to the use of microhaplo- 
types in forensic DNA analysis. Further studies are needed to explore the potential of 
microhaplotypes in various aspects of forensic DNA analysis, such as mixture interpre- 
tation, quality control, and validation of new assays. Additionally, there are issues 
regarding data interpretation, privacy concerns, and ethical considerations that should 
be addressed. 
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Overview of the concept of population study 


Population study, also referred to as population genetics study, constitutes a subfield of 
genetics that employs mathematical and statistical methods to study the variations of 
gene frequencies and genotype frequencies in biological populations, scrutinizes the pop- 
ulation genetic structures and relationships, and explores the population genetic back- 
grounds. Moreover, population studies, especially population genetics studies, play a 
pivotal role in common forensic genetics applications, including individual identification, 
paternity testing, and ancestry informative inference. Hereby, we generally focus on the 
systematic studies of population genetics in forensic genetics and introduce the basic con- 
cepts of a population study, the theoretical basis of genetics, the relevant parameters of 
population genetics, as well as the corresponding forensic genetics applications. 


The concept of population study 


The term “population” generally refers to all individuals of the same species, such as all 
living human beings on the earth. The “population” mentioned in this chapter specif- 
ically refers to the human population of the context of forensic applications. The gene 
pool is defined as the entire collection of genes shared by individuals within a freely mat- 
ing population. The genetic structure of a population is determined by the frequencies of 
various genes and the distributions of genotypes generated from different mating patterns. 
Each individual in the population inherits alleles from their biological parents and subse- 
quently passes these alleles on to their offspring. Frequencies of genes and genotypes in 
the population are influenced by various evolutionary forces as the genes and genotypes 
pass down through generations and eventually lead to variations of allele polymorphisms 
in the gene pool. The evolutionary forces which affect the genetic diversity of the gene 
pool include mutation, gene drift, natural selection, environmental diversity, migration, 
and nonrandom mating pattern, etc. 
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In order to investigate the distributions of gene frequencies and genotype frequencies 
among populations, population study applying mathematical and statistical methods is 
able to investigate the genetic divergence of gene frequencies and genotype frequencies 
among and within the populations. Exploring the interrelations between genetic struc- 
ture variations and factors of evolutionary forces is of great importance in population ge- 
netics; the latter comprises environmental selection, genetic mutation, migration, and 
genetic drift, etc. The subjects of forensic population study are Mendelian populations, 
the relatively larger populations of the same species which propagate sexually, where in- 
dividuals therein mate randomly and follow Mendel’s laws of inheritance. 


Genetic theories of population study 


Population genetics is essentially a combination of genetics and evolutionary theories. In 
terms of genetics, classical population genetics focuses on fundamental principles like 
Mendel’s laws of segregation and independent assortment, along with Morgan’s law of 
linkage and crossing-over. This concentration aims to track dynamic population changes 
to make more precise predictions about specific population developments. In terms of 
evolutionary theories, scientists like R. A. Fisher, J. B. S. Haldane, and Sewall Wright 
developed and refined the “modern evolutionary synthesis” on the basis of Darwinism 
theory and mathematical statistics in the early 20th century (Servedio et al. 2014). After 
integrating with mathematical statistical models such as the Hardy-Weinberg equilibrium 
law and the Wright—Fisher model, classical population genetics has been perfected by 
eventually transforming into an organic unity of genetics and statistics. Since the dawn 
of molecular biology in the mid-to-late 20th century, scientists have found a myriad 
of genetic markers with different levels of polymorphisms which cannot be explained 
by current evolutionary theories. Therefore, Kimura (1968) introduced the “neutral the- 
ory of molecular evolution”, which seems to be the counterpart of modern evolutionary 
synthesis, and is developed based on the theory of natural selection. However, these two 
views have dialectically compensated for each other’s theoretical weakness and collec- 
tively formed the basis of modern population genetics. 


Forensic applications of population study 


Forensic genetics, as an indispensable branch of forensic science, has been rapidly devel- 
oped and widely used since its inception. With the continuous expansion of its applica- 
tions and scope of examinations, it has already become one of the most common and 
effective technologies for criminal/case investigations. However, the rise of intelligent 
criminal means has led to a higher proportion of high-tech crimes which puts forward 
more stringent demands for in-depth mining of biological evidence to tackle forensic- 
related issues such as individual identification, paternity testing, and ancestry inference. 
Forensic population studies mainly focus on exploring genetic markers with polymor- 
phisms in certain populations, further calculating a series of forensic parameters of genetic 
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markers via population genetic structure information of allele frequencies and genotype 
frequencies, and eventually constructing genetic marker systems which are qualified to 
perform forensic individual identifications and paternity tests in different populations. 
In addition, forensic population studies regularly survey the genetic background of the 
target population and its genetic relationships with other populations by analyzing pop- 
ulation genetic structure data with statistical methods so as to obtain genetic information 
like the allele and genotype frequencies of the targeted population and lay the foundation 
for calculating forensic parameters such as likelihood ratio and paternity index (PI) for the 
forensic individual identification, paternity test, etc. 

Forensic genetics study requires massive data of genetic markers in different popula- 
tions as a basis. Currently, plenty of population studies have been carried out on both the 
traditional capillary electrophoresis (CE) platform and the novel next-generation 
sequencing (NGS) platform. As a mainstream platform in current criminal/case investi- 
gations, the CE platform detects the genotypes of targeted loci based on the differences in 
fluorescent dye markers and allelic fragment lengths of loci and accordingly obtains the 
length polymorphism information of genetic markers. Compared with the CE platform, 
the second-generation sequencing platform is not only able to access the length polymor- 
phism information of conventional genetic markers but also able to precisely read the 
sequence information, resulting in a more comprehensive acquisition of the genetic 
information of targeted markers and a more comprehensive evaluation of the forensic 
efficacies of different genetic marker systems. There are three main research topics in 
forensic population studies based on the NGS platform which still remain to be studied 
nowadays. Firstly, how to fully take advantage of the high-throughput NGS platform 
and apply more genetic markers with more different types for analyzing the genetic back- 
ground of the targeted population, and also to provide a simultaneous evaluation of the 
forensic utility of multiple genetic markers as well as the related population genetics data; 
Secondly, how to explore the genetic structures, backgrounds, and relationships among 
different populations more effectively based on the comparative analysis of genetic data 
obtained from various populations; thirdly, how to investigate the independence of ge- 
netic markers with one or more types in the same or different panels based on the DNA 
profiling results from multiple genetic marker systems; survey the feasibility and efficacy 
of the combined application of multiple panels with different types of genetic markers; 
and also develop a statistical method for analyzing the forensic applicability of the multi- 
plex systems using one or more types of genetic markers based on mass data. 


Forensic parameters of population genetics 


Candidate molecular genetic markers or detection systems consisting of genetic markers 
need to be evaluated for system performance before they are used in forensic practice. 
The evaluation mainly includes three interrelated aspects: one is to evaluate the 


123 


124 


Next Generation Sequencing (NGS) Technology in DNA Analysis 


population applicability of candidate molecular genetic markers, that is, whether the 
related genetic markers conform to the Hardy-Weinberg equilibrium in the target pop- 
ulation. Meanwhile it is in a state of linkage equilibrium in the target population; the sec- 
ond is to evaluate the polymorphisms of the candidate molecular genetic markers, that is, 
to obtain the genetic polymorphism parameters of genetic markers through a population 
survey. The purpose of this evaluation is to obtain molecular markers with higher levels 
of polymorphisms to achieve higher efficiency in forensic applications such as individual 
identification and paternity testing. The third aspect of the validation experiments is to 
identify the genetic relationships among populations and build statistical analysis methods 
for forensic applications such as kinship identification, and ancestry information source 
inference based on the constructed system. 


Population applicability of candidate molecular genetic markers 
Hardy-Weinberg equilibrium test 

Hardy-Weinberg equilibrium, also known as the Hardy-Weinberg rule, is an important 
statistical test method in population genetics. This rule was discovered in 1908 by two 
scholars, Hardy, G. H., an English mathematician; and Wilhelm Weinberg, a German 
physician, this theorem was named after their surnames. The theorem asserts that under 
ideal conditions, the gene frequency and genotype frequency of a population remain un- 
changed from generation to generation, and it clarifies the effect of reproduction on the 
gene and genotype frequencies of a population. The significance of the Hardy-Weinberg 
equilibrium law has two main points: one is to reflect the relationship between gene fre- 
quency and genotype frequency; the second is to evaluate whether the result of the sam- 
pling survey is representative of the target population, and then to evaluate the reliability 
of the population research results. The Hardy-Weinberg equilibrium law requires five 
ideal conditions to be satisfied: an infinite population; random mating among individuals 
within the population; no mutation; no natural selection; and no migration. Only the 
candidate genetic markers are in Hardy-Weinberg equilibrium in the applied population, 
the statistical analyses and calculations of the likelihood ratio and PI based on the poly- 
morphism data of these genetic markers can ensure the reliability and accuracy, which 
could further achieve the purposes of forensic individual identification and kinship 
identification. Therefore, for new candidate genetic markers, it is necessary to assess 
whether they are in Hardy-Weinberg equilibrium in the studied population. 


Linkage and linkage disequilibrium 
(1) Linkage 

Linkage refers to the tendency of coinheritance of genes which are physically close to 
each other on the same chromosome during the meiotic stage of sexual reproduction. In 
the process of meiosis, the non-sister chromatids are crossed over, therefore, it can be 
considered that genetic markers with closer physical distances are more closely related 
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than those with longer physical distances. In other words, the closer two loci on a chro- 
mosome are, the lower the chance of recombination between them, and the more likely 
it is that they are co-inherited, such as the microhaplotype (MH) genetic marker, since it 
is in the range of 300 bp. A composite marker is composed of multiple SNPs, and 
different SNPs are inherited in linkage. In addition, genetic markers on different chro- 
mosomes are generally considered to be unlinked. For autosomes, the physical distance 
between genetic markers on the same chromosome is greater than 10 Mb, which should 
be unlinked. When genes or loci are randomly combined during genetic material 
transmission, they are in a state of linkage equilibrium. The unlinked loci in linkage 
equilibrium state are independent of each other, and the product rule can be applied 
to calculate the cumulative probability of the system (which includes multiple loci), 
and each independent genetic marker can be combined to calculate. 

(2) Linkage disequilibrium 

Linkage disequilibrium refers to a nonrandom combination phenomenon that the 
probability of simultaneous occurrence of alleles belonging to two or more loci is 
different from the expected probability of random occurrence in a population. For 
example, the alleles of each locus of the human leukocyte antigen gene are not 
completely random to form haplotypes, and the observed frequencies of haplotypes in 
the population data are different from the expected frequencies. Loci in linkage disequi- 
librium might be located in different regions on the same chromosome or on different 
chromosomes. The influencing factors of linkage disequilibrium pairwise loci include 
natural selection, genetic drift, mutation, recombination, and linkage. 

Linkage and linkage disequilibrium are both related and distinct. Genes linked to each 
other are not only related to the physical distances between chromosomes but also some- 
times to the inheritance pattern of the disease, which can be tested in pedigree data. Link- 
age disequilibrium reflects the non-independent co-occurrence of alleles at different loci 
and describes associations between alleles which can be tested in pedigree and population 
genetic study data. When two genetic markers are in linkage or linkage disequilibrium, it 
is necessary to use haplotype frequencies and related parameters to describe and evaluate 
the genetic structure of the population. 


Genetic markers located on sex chromosomes 

Humans have a pair of sex chromosomes, XX for females and XY for males. There are 
also various types of polymorphic DNA genetic markers on sex chromosomes, such as 
short tandem repeats (STRs), SNPs, and InDels. Among them, the mother can pass 
the two X chromosomes to the offspring randomly, and the inheritance pattern is similar 
to that of the autosomes. When male sex chromosomes undergo meiosis, the pseudoau- 
tosomal region, constituting approximately 5% of the total length at both ends of the Y 
chromosome, can exchange and recombine with the homologous segment of the X 
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chromosome. The inheritance pattern of genetic material in this segment is similar to that 
of the autosomes. The other 95% of the Y chromosome segment (male-specific region, 
MSY) is different, when genetic material is passed on from the father, it is transmitted on 
to the son by haplotype, and the sequence located in the specific region of the father’s X 
chromosome is also passed on by haplotype to the daughter. Due to the special transmis- 
sion pattern of the genetic material of sex chromosomes, the genotypes of the sex chro- 
mosomal genetic markers are more advantageous in the forensic identification of special 
cases and complex kinship relationships. 

Due to the uniqueness of inheritance patterns, sex chromosomal genetic markers are 
calculated differently from autosomal chromosome markers when testing for Hardy- 
Weinberg equilibrium and linkage disequilibrium. In the case of X-chromosome genetic 
markers, the Hardy-Weinberg test first examines whether the gene frequencies of these 
markers in different genders are not statistically different. Furthermore, for female indi- 
viduals, if the X-genetic markers adhere to the Hardy-Weinberg rule, they are consid- 
ered to be in Hardy-Weinberg equilibrium within the targeted population. The 
genetic marker of Y-chromosome-specific region is linked to inheritance and passed 
on to male offspring as a whole, which is identified as a Y-chromosome haplotype. 
The X-chromosome genetic markers are distributed on the 153 Mb X chromosome, 
and the density of the genetic markers on the X-chromosome is comparable to that of 
the autosomes. When the physical distance between the two X-chromosome genetic 
markers is relatively close, the possibility of linked inheritance should also be considered. 
Among them, clusters of closely linked genetic markers, called linkage groups, should be 
used as haplotypes for statistical computational analysis. 


Polymorphism parameters of genetic markers 


Alleles composed of different sequences among different individuals or between two 
different DNA copies of the same individual form a DNA polymorphism. According 
to the structural characteristics of DNA genetic markers, DNA polymorphisms can be 
divided into two types: length polymorphisms and sequence polymorphisms. Length 
polymorphism refers to a polymorphism composed of DNA fragment length differences 
among alleles at the same locus. Currently, STR and InDel are widely used genetic 
markers of length polymorphisms. A sequence polymorphism marker refers to a poly- 
morphism consisted of one or more base differences in the DNA sequences of different 
individuals or between two DNA copies at the same locus of the same individual. SNPs 
are one of the most common genetic markers of sequence polymorphisms and exist on 
both chromosomes and mitochondrial DNA. 

Common parameters for evaluating genetic marker polymorphism include gene fre- 
quency, genotype frequency, haplotype frequency, heterozygosity, and polymorphism 
information content (PIC), etc. The polymorphism parameters of each marker can be 
obtained through a population survey, and the surveyed population size must be large 
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enough. According to the recommendations of the International Society of Forensic Ge- 
netics (Gusmao et al., 2017), for population genetic studies based on autosomal STR (A- 
STR), X chromosome STR (X-STR), autosomal SNP/InDel (A-SNP/InDel), and X 
chromosome SNP/InDel (X-SNP/InDel), the number of samples should be more 
than 500; Y chromosome STR (Y-STR) must be more than 400 individuals; Y chromo- 
some SNP/InDel (Y-SNP/InDel) must be more than 300 individuals; mitochondrial 
DNA (mtDNA) control region sequence based on Sanger sequencing (Full mtDNA 
CR_Sanger) samples should be more than 200; mitochondrial full sequence (Full 
mtDNA molecule_Sanger) sequencing samples based on Sanger sequencing should be 
more than 100; mtDNA coding region SNP (mtDNA_SNPs in coding region) detection 
should be more than 200 samples; population studies based on NGS platforms should be 
greater than 50 individuals. 


Allele frequency and genotype frequency 

A gene at a specific position on a chromosome is called a locus, and the genes which are 
located in the same position of a pair of homologous chromosomes with differences in 
the primary structure of DNA, are called alleles. In a population, there are three or 
more possible alleles at one locus, which are called multiple alleles, such as the STR locus 
and the ABO blood group gene. When there are only two alleles at a locus in the pop- 
ulation, it is considered a biallelic locus, such as SNP and InDel. 

Allele frequency refers to the ratio of the number ofa certain allele at a particular locus 
to the total number of alleles at the locus, which is a basic measure for evaluating the ge- 
netic structure of a population. The sum of all the allele frequencies at a given locus adds 
up to 1. Genotype frequency refers to the ratio of the number of individuals with a spe- 
cific genotype to the total number of individuals in a population, and the sum of all ge- 
notype frequencies at this locus is 1. The population genetic structure is represented by 
the frequency of all alleles and the frequency of various genotypes in the population. For 
a population, there are two or more alleles at a certain locus, and the frequency of each 
allele is more than 0.01, which means that the locus has a genetic polymorphism in that 
population. 


Haplotype frequency 

When two loci are associated or linked, the single locus is not independent, and the law of 
probability multiplication cannot be used to calculate parameters such as cumulative indi- 
vidual identification probability. In such cases, haplotype frequencies are needed to 
describe the genetic structure of the population or to calculate forensically relevant prob- 
abilities. A haplotype refers to the combination of alleles at linked loci, and the alleles 
constituting the haplotype are inherited in a linked manner. The haplotype of autosomal 
genetic markers needs to be determined by family survey, because when two linked loci 
are all heterozygous, it is impossible to judge whether two loci are linked and inherited 
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based solely on the results of individual typing. Maternally inherited mtDNA and pater- 
nally inherited Y-chromosome genetic markers can be used to determine haplotypes based 
on individual typing results and can be calculated directly using counting method. 


Heterozygote (He) and polymorphism information content 

Heterozygosity is a traditional genetic parameter that refers to the proportion of hetero- 
zygotes among all genotypes for a genetic marker in a population. The more alleles of a 
genetic marker, the closer the value of a single allele frequency is to the average of all 
allele frequencies of the genetic marker, and the proportion of heterozygotes in all geno- 
types is also higher. The higher the heterozygosity, the greater the application efficiency 
in forensic personal identification. Since males have only one X chromosome, only fe- 
male individuals were considered when assessing heterozygosity for X chromosome ge- 
netic marker. Heterozygosity is usually expressed as H/h, where h is the observed value of 
heterozygosity. The formula is as follows: 


h = mamber of heterozygotes/total number of individuals (8.1) 


H is the expected value of heterozygosity, and the calculation formula is as follow: 


k 
H=n/o-0(1- Soa) (8.2) 
i=1 


In the formula, n is the number of samples, k is the number of allele, the p; represents 
the frequency of the ith allele of a genetic marker. 

PIC isa quantitative indicator for evaluating the usefulness of a polymorphic locus. The 
value of PIC can be calculated based on the number of alleles of a genetic marker and the 
distribution frequencies of its alleles. Similar to heterozygosity, a higher number of alleles at 
the candidate locus and a more balanced frequency distributions of all the alleles result in a 
greater PIC value. Generally considered, when PIC > 0.5, genetic marker is rich in ge- 
netic information; 0.25 < PIC < 0.5, the genetic information content of the candidate 
marker is general; when PIC < 0.25, the genetic information content of the candidate 
marker is less. The formula for calculating the PIC of a genetic marker is as follow: 


n—1 n 


PIC =1- Se —S > SS 2pip? (8.3) 
i=1 


i=1 j=i+1 
P; and p; indicate the frequency of the ith and jth alleles of a genetic marker, n is the 


number of allele. 


Forensic application analysis based on population genetics 


Data from population genetics are the basis for forensic applications such as individual 
identification, paternity testing, and ancestry inference, et al. 
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Individual identification 

Comparing the genotypes of molecular markers obtained by the detection of biological 
sample left at the crime scene with the forensic DNA database is an important method to 
detect unknown individual in forensic investigation practice. Due to the limited number 
of the used genetic markers or insufficient information about genetic markers, there may 
be instances of false positive and false negative results in the comparison of the sample 
genotype with the database genotype. Random match probability (RMP) and power 
of discrimination (DP) are indicators to evaluate the accuracy of the conclusion of indi- 
vidual identification by DNA typing and help to judge the matching results. For newly 
developed candidate loci or detection systems, RMP and DP can be used to measure the 
efficacy of a genetic marker or detection system for individual identification. 

1. Random match probability. 

The probability of random match is a commonly used parameter in population genetics 
that expresses the probability that two individuals randomly selected from the population 
have the same genotype or haplotype. In forensic identification, the RMP can be 
explained that when the genotypes of the two samples are the same, the on-site samples 
are not left by the suspect but by a random individual. And RMP calculates the posibility 
of this occurrence in the above situation. The smaller the probability, the less likely the on- 
site inspection material is left by a random individual rather than a criminal suspect, indi- 
cating that the genotype matching of the on-site inspection material and the suspect sample 
is less likely to be a random event, which is more supportive of the hypothesis that the two 
genotypes came from the same person, and also supports the hypothesis that the on-site 
inspection material was left by the suspect. Human individual identification compares 
two samples by a series of genetic markers to determine whether they are from the 
same individual. 

Fora genetic marker with multiple alleles on an autosome, the formula for calculating 
the probability that two individuals are randomly selected in a population to have the 
same genotype is as follows: 


RMP = 5° p; (8.4) 
i=1 


P, indicates the frequency of the ith allele n indicates the number of alleles of the genetic 
marker. 

Combined RMP (CRMP) refers to the probability that two individuals are randomly 
selected from a population and their genotyping results of multiple genetic markers are 
the same. CRMP can be calculated as follows. 


CRMP = || Rup; (8.5) 
j=l 


RMP; indicates the RMP of the jth allele, and n indicates the number of genetic markers. 
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2. Discrimination power. 

Individual discrimination power is another parameter corresponding to the probabil- 
ity of random matching, which refers to the probability that two individuals are randomly 
selected from the population and their genotypes are different. The ability ofa polymor- 
phic genetic marker to discriminate among random unrelated individuals can be assessed 
by this parameter. The greater the number of alleles of a polymorphic genetic marker and 
the more uniform the distributions of different alleles, the higher the individual identi- 
fication ability of the genetic marker. DP is also an index to evaluate the effectiveness 
of genetic marker in identifying irrelevant individuals. The higher the DP, the stronger 
the genetic marker is in identifying irrelevant individuals. For a single genetic marker 
with multiple alleles in an autosomal chromosome, the formula for calculating DP is 
as follows: 


DP =1-— 5p? = 1— RMP (8.6) 
i=1 


P; indicates the frequency of the ith allele indicates the allele number of this marker. 
When multiple genetic markers are used in combination, it is expressed by the DP 
(CDP). Calculated as follows: 
n n 
cpp = 1~ || Rup = 1- |] (1- pp) (8.7) 
j=1 j=1 


RMP; indicates the RMP of the jth allele, DP; indicates the DP of the jth allele, n indicates 
the number of markers. 


Parentage testing 

Parentage testing is based on the DNA typing results of genetic markers and is analyzed 
according to the laws of inheritance to determine the parentage relationship between the 
alleged father (AF) or the alleged mother (AM) and their offspring. Probability of exclu- 
sion (PE) and cumulative PE (CPE), mutation rate, PI, and combined PI (CPI) are 
important parameters in parentage testing application. 

1. Probability of exclusion and cumulative probability of exclusion. 

Nonpaternal exclusion probability refers to the probability that non-biological father 
of the child can be excluded by a genetic detection system. This value is used to assess the 
utility of a genetic detection system in paternity testing case. The higher the polymor- 
phisms of the genetic markers in the detection systems, the stronger the power to exclude 
nonbiological father. Taking the commonly used multiple allelic STR as an example, the 
pi indicates the frequency of ith allele of a marker, p; indicates the frequency of jth allele, 
the ith allele and jth allele are different, n indicates the number of alleles, then the non- 
paternity exclusion rate of this locus is: 
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' 2 
PE = 5 pi(1— pi) — 1/2 
i=1 


n—1 on 
See (+-an-3n)] es 
i j=l 


In practical applications, it is often necessary to combine multiple loci in the identi- 
fication system to determine the possibility of excluding paternity, that is, the cumulative 
nonpaternal exclusion rate (CPE). When two or more genetic markers are independent 
of each other, the CPE can be calculated by the product of the probability law. The for- 
mula for calculating CPE is: 

n 
CPE = 1 — (1 — PE\)(1 — PEp)(1 — PEs)...(1 — PE,) = 1— | [ (1- PE,) (6.9) 
I=1 
PE,, is the PE value of the marker n. 
2. Mutation rate. 

DNA replication error or mis-repair during DNA damage repair can lead to muta- 
tions. Mutations which occur during meiosis will be passed on to the next generation 
along with the genetic material, resulting in a phenomenon that does not conform to 
the laws of inheritance between parent and offsprings. As the number of tested genetic 
markers increases, the potential for variants also increases. The high throughput of the 
NGS platform will surely lead to an increase in the number of genetic markers which 
can be detected at once. Reasonably and accurately assessing the mutation rates of various 
genetic markers used in forensic practice is also an important purpose of conducting pop- 
ulation genetic research based on the NGS platform. The primary method for assessing 
mutation rates in forensic genetic markers is based on pedigree samples with at least 500 
meiosis. 

(3) Paternity index and combined paternity index. 

The PI is a parameter for judging the power of genetic evidence in paternity testing 
and is the likelihood ratio (LR) of two conditional probabilities, the ratio of hypothesis 1 
“the probability (X) of a man with alleged father (AF) genetic marker genotypes is the 
biological father of the child” to hypothesis 2 “the probability (Y) that a random man 
is the biological father of the child”. Calculated as follows: 

xX 
PI=LR => (8.10) 


When multiple genetic markers are used for parentage determination and multiple 
genetic markers should be independent of each other, the CPI can be obtained by 
multiplying the PI of multiple markers. The formula for calculating the cumulative PI 
of n genetic markers is as follows: 


CPI = Pl, X Pl xX Ph x -:: X PI, (8.11) 


PI, is the PI value of the nth marker. 
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Inference of ancestry informative 

For the investigation of suspect when the database comparison fails, inferring ancestry is 
one of the most effective measures to help lock the suspect and narrow the scope of the 
investigation. Through the analysis of ancestry informative markers with large differences 
in gene frequencies among different populations combined with the results of multiple 
population genetic parameters, the ancestry information of an unknown individual can 
be determined. A precise inference of the ancestry origin of the biological material at 
the scene can help narrow the scope of the investigation and improve the speed of solving 
case. The main parameters involved in the application of ancestry information inference 
are 6 value and For value. 

1. 5 value. 

6 value represents the difference between the absolute allele frequencies of the same 
allele in two different regions or ethnic groups. The greater the difference in the allele 
frequency of the locus in the two populations, the larger the 5 value. When screening 
ancestral informative loci, the 8 value is one of the important reference indicators. 
When the 6 value of two populations is larger than 0.5, the genetic marker is defined 
as the specific marker (population-specific markers) of the two target populations, which 
can be considered as the ideal genetic marker for ancestry-informative inference for the 
two target populations. 

2. F ST value. 

Due to the differences in genetic structure and genetic background, the distributions 
of alleles and genotypes between different populations are quite different. When per- 
forming population genetics analysis, the polymorphism parameters of a certain popula- 
tion cannot be generalized and applied to another group with larger differences. 
Therefore, when analyzing several different populations, it is first necessary to understand 
whether there are differences between populations and the extent of the differences. The 
Fr value is also called the fixation index (Fsr) or the genetic differentiation index be- 
tween populations and is an important parameter to measure the degree of differentiation 
and genetic distance between populations. 

Due to the differences and specificities of different molecular genetic markers in 
geographical distribution and population distribution, the thresholds of Fs values are 
different for different purposes in different research projects when screening genetic 
markers. For example, for a genetic marker used for individual identification, the Fs 
value should be as small as possible. The smaller the Fs7 value, the smaller the difference 
in gene frequency of the genetic marker between different populations, and it can be 
used more widely for individual identification of multiple populations. For a genetic 
marker used for ancestry-informative inference, higher Fsy values between different 
populations are anticipated. And the larger the value, the greater the difference in 
gene frequency of the genetic marker between different populations, which can be 
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used to distinguish different populations. According to different target populations, 
different types of genetic markers, different numbers of genetic markers, and different 
application purposes, the researchers proposed different Fs cutoff reference values. 


Commonly used markers of population study on NGS platform 


At present, DNA genetic markers in forensic population genetic research based on the 
different NGS platforms include length polymorphism genetic markers (e.g., STR), 
sequence polymorphism genetic markers (e.g., SNP), mitochondrial DNA, MH genetic 
markers, and compound genetic markers (e.g., InDel-STR, InDel-SNP, SNP-STR). 
These markers are mainly used for forensic individual identification, paternity testing, 
ancestry-informative inference, etc. 


Short tandem repeat 


Currently, STR is the most widely used genetic marker in forensic genetics, and it is 
also one of the important genetic markers for population genetic research based on 
NGS platform. Increasing the number of STR loci is an effective strategy to decrease 
the probability of random matching. However, due to the limitations of fluorescence 
types and the length of amplified fragments, the traditional CE platform has a limited 
number of detectable loci. The high-throughput NGS platform is not limited by the 
number of loci. Also, because the detection amplicon of the NGS platform is smaller, 
it is more suitable for trace and degraded samples. In addition, the NGS platform can 
obtain the sequence information of the core region and flanking region in addition 
to the detection of the repeat number of the core region of the STR, so that the 
sequence differences in these regions can be found and the polymorphism of the locus 
can be further improved. Katherine Butler Gettings et al. used 22 autosomal STR loci 
to sequence 183 unrelated African Americans, Caucasians, and Hispanics based on 
NGS, and found that 6 loci (D128391, D2S1338, D21S11, D8S1179, vWA, and 
D3S1358) have more than twice as many alleles on sequence polymorphisms as alleles 
on length polymorphisms (Gettings et al., 2016). The CODIS STR loci are widely used 
in intra-laboratory-conducted systems and commercial kits because of their high poly- 
morphisms. According to literature reports, in different populations, D3S1358, 
D5S818, D7S820, D8S1179, D13S317, D18S51, D21S11, and vWA and other loci 
with more complex core regions were found to have more variations, the D21S11 
showed the largest increase in the number of alleles on sequence polymorphisms 
comparing with length polymorphisms, which was found among many populations 
(Gettings et al., 2016; Kwon et al., 2016; Novroski et al., 2016; Warshauer et al., 
2015; Zhao et al., 2015). Wang et al. used the SeqIypeR24 CASE kit to sequence 
and analyze 291 Beying Han individuals and found a total of 234 length 
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polymorphism -based alleles and 356 sequence polymorphism-based alleles. And 22 
new sequence types in the core repeat regions were found at 9 STRs (D12S391, 
D1S1679, D21S11, D2S1338, D31358, D5S818, D6S474, FGA, vWA) (Chen et al., 
2021). In addition, the researchers applied other commercial or self-built NGS-STR 
systems, indicating that the sequence-characterized STR typing obtained by the 
NGS platform has higher polymorphism characteristics, which can achieve higher in- 
dividual identification accuracy (Barrio et al., 2019; Chen et al., 2021; Dai et al., 
2019; Dash et al., 2021; Houston et al., 2018; Huszar et al., 2018; Kim et al., 2017; 
Lee et al., 2021; Liu et al., 2020; Miao et al., 2021; Silva et al., 2020; Wang et al., 
2020; Warshauer et al., 2015; Wu et al., 2020; Zhang et al., 2018; Zhao et al., 2015, 
2016). The high-throughput NGS platform allows a composite amplification system 
to contain more genetic markers and different types of markers. The NGS ForenSeq 
DNA Signature Prep Kit is a typical multi-marker composite detection system that 
can simultaneously achieve several forensic application goals. A number of studies based 
on the NGS ForenSeq DNA Signature Prep Kit in Chinese Han (Li et al., 2020), Chi- 
nese Tibetan (Li et al., 2021; Peng et al., 2020), Chinese Li (Fan et al., 2021), Chinese 
Hui group (He et al., 2021), France (Delest et al., 2020), White British, British Chinese, 
South Asian, West African and North East African (Devesse et al., 2020), Danish 
(Hussing et al., 2019), Saudi Arabian (Khubrani et al., 2019), US Caucasian, Hispanic, 
African American, and East Asian Chinese (Novroski et al., 2016), Yavapai Native 
American (Wendt et al., 2017) and Korean (Kim et al., 2018) and other populations 
on 26 A-STRs, 24 Y-STRs, and 6 X-STRs have shown that the heterozygosity, 
DP, PIC, RMP of STR typing technique based on sequence polymorphism analysis 
were larger than those of STR typing technique based on fragment length polymor- 
phism detection. The cumulative RMP of 26 A-STR loci on sequence polymorphisms 
is over than 700 times higher than that of length-based loci alone (Devesse et al., 2020), 
which greatly enhances the polymorphic information contents of STR loci and im- 
proves the efficiencies of forensic applications such as individual identification and pa- 
ternity testing. In addition, ancestry-informative inference is also an applied aspect of 
STR genetic markers. Based on the NGS platform, Hirak Ranjan Dash et al. conducted 
a study with 31 STRs on a central Indian population, showing that the central Indian 
population has relatively close genetic relationships with other Indian populations. 
Based on the genotyping results of these 31 STRs, they also observed that other refer- 
ence populations clustered with each other according to geographical distances, and the 
genetic relationship results of populations showed strong correlation with geographical 
locations (Dash et al., 2021). 


Single nucleotide polymorphism 


SNPs are single-base polymorphisms in DNA sequences, such as substitutions, inser- 
tions, and deletions. SNP genetic markers are widely distributed in the human genome, 


Table 8.1 STR studies on NGS platforms. 


No. 


Machines 


Ion torrent 
PGM 
MiSeq 


Ion torrent 
PGM 

MiSeq 

MiSeq 
FGx 


Ion torrent 


S5 


Kits 


Custom primers 


Custom primers 


PowerSeq Auto 
System 


A simplified 
PCR-based 
library 
preparation 
method 

ForenSeq DNA 
Signature prep 
Kit 


Custom primers 


Custom primers 

ForenSeq DNA 
Signature prep 
Kit 


Custom primers 


Number 
of 
markers 


Details of 
markers 


28 Y-STRs 


13 Y-STRs 


22 A-STRs, 1 Y- 
STR, and 
Amelogenin 


23 Y-STRs 


27 A-STRs, 7 X- 
STRs and 24 
Y- STRs 


10 A-STRs 


23 A-STRs 

27 A-STRs, 7 X- 
STRs, and 25 
Y- STRs, 
Amelogenin, 
and 94 iiSNPs, 
56 aiSNPs, 
and 22 piSNPs 

12 A-STRs 


Populations 


African 
American, 
Caucasian, and 
Hispanic 

Chinese Han 
population 

African 
American, 
Caucasian, and 
Hispanic 

Korean 


US Caucasian, 
Hispanic, 
African 
American, and 
East Asian 
Chinese 

Chinese Han 


Korean 


Yavapai Native 
American 


Cannabis sativa 


Sample 
numbers 


References 

Warshauer et al. 
(2015) 

Zhao et al. (2015) 

Gettings et al. 


(2016) 


Kwon et al. (2016) 


Novroski et al. 
(2016) 


Zhao et al. (2016) 


Kim et al. (2017) 
Wendt et al. 
(2017) 


Houston et al. 
(2018) 


Years 


2015 


2015 


2016 


2016 


2016 


2016 


2017 
2017 


2018 


Continued 
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Table 8.1 STR studies on NGS platforms.—cont’d 


No. 


10 


11 


12 


13 


14 


15 


16 


17 


18 


19 


Machines 


MiSeq 
FGx 


MiSeq 
FGx 


Ion torrent 
PGM 
Ion S5 


Ion torrent 
PGM 

MiSeq 
FGx 


MiSeq 
FGx 


MiSeq 
FGx 


MiSeq 
FPGx 


MiSeq 
FGx 


Kits 


PowerSeq Auto/ 
Mito/Y 
System kit 

MiSeqFGx 
forensic 
signature kit 

Custom primers 


Precision ID 
GlobalFiler 
NGS STR 
panel v2 

SeqType R16 


ForenSeq DNA 
Signature prep 
kit 

ForenSeq DNA 
Signature prep 
kit 

ForenSeq DNA 
Signature prep 
kit 

ForenSeq DNA 
Signature prep 
kit 


ForenSeq DNA 
Signature prep 
kit 


Number 
of 
markers 


29 


27 


118 


27 


154 


26 


Details of 
markers 


23 Y-STRs 


27 A-STRs 


34 A-STRs 


31 A-STRs 


16 A-STRs 


26 A-STRs, 24 
Y-STRs, and 
6 X-STRs 

27 A-STRs and 
91 iSNPs 


27 A-STRs 


27 A-STRs, 7 X- 
STRs, and 25 
Y- STRs, 
Amelogenin, 
and 94 iiSNPs 

26 A-STRs 


Populations 


multi-groups 
Korean 
Chinese Han 
Spanish 


individuals 


Chinese Han 


Danish 


Saudi Arabian 


multi-groups 


French 


White British, 


British 
Chinese, 


South Asian, 
West African 


and North 
East African 


Sample 
numbers 


100 


209 


200 


496 


593 


363 


89 


350 


169 


1007 


References 


Huszar et al. 
(2018) 


Kim et al. (2018) 


Zhang et al. (2018) 


Barrio et al. (2019) 


Dai et al. (2019) 


Hussing et al. 
(2019) 


Khubrani et al. 
(2019) 


Aalbers et al. 
(2020) 


Delest et al. (2020) 


Devesse et al. 
(2020) 


Years 


2018 


2018 


2018 


2019 


2019 


2019 


2019 


2020 


2020 


2020 
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20 


21 


22 


23 


24 


25 


26 


27 


28 


29 


MiSeq 
FGx 


MiSeq 
FGx 
MiSeq 
FGx 


MiSeq 


Ion S5SXL 


MiSeq 
FGx 


Ion torrent 
PGM 
Ion S5 


MiSeq 
FGx 


MGISEQ- 
2000 


ForenSeq DNA 
Signature prep 
kit 


Custom primers 


ForenSeq DNA 
Signature prep 
kit 


PowerSeq Auto/ 
Y system 
prototype kit 

Precision ID 
GlobalFiler 
NGS STR 
panel v2 

Custom primers 


SeqTypeR24 
CASE kit 

Precision ID 
GlobalFiler 
NGS STR 
Panel v2 

ForenSeq DNA 
Signature prep 
kit 


MPS-based 
Forensic 
Analysis 
System 
Multiplecues 
SetB Kit 


154 


29 


154 


134 


26 A-STRs, 24 
Y-STRs, and 
6 X-STRs 
42 A-STRs 


27 A-STRs, 7 X- 
STRs, and 25 
Y- STRs, 
Amelogenin, 
and 94 iiSNPs 

23 Y- 

STRs + SE33 


30 A-STRs 


28 A-STRs 
loci + 
Amelogenin 
23 A-STRs 


31 A-STRs 


27 A-STRs, 7 X- 
STRs, and 25 
Y- STRs, 
Amelogenin, 
and 94 iiSNPs 

52 STRs and 81 
Y-STRs and 
one Y-InDel 
(M175) 


Chinese Han 


Chinese Hebei 
Han 
Chinese Tibetan 


US population 


Chinese Tibetan 
and Han 


Chinese Yunnan 
Bai 


Chinese Bejing 


Han 
Central Indian 


Chinese Hainan 
Hlai 


Chinese Han 


750 


58 


107 


786 


115 


203 


291 


138 


136 


413 


Li et al. (2020) 


Liu et al. (2020) 


Peng et al. (2020) 


Silva et al. (2020) 


Wang et al. (2020) 


Wu et al. (2020) 


Chen et al. (2021) 


Dash et al. (2021) 


Fan et al. (2021) 


Fan et al. (2022) 


2020 


2020 


2020 


2020 


2020 


2020 


2021 


2021 


2021 


2021 
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Table 8.1 STR studies on NGS platforms.—cont'd 


No. 


30 


31 


32 


33 


34 


35 


36 


Machines 


MiSeq 
FGx 
MiSeq 
FGx 


BGISEQ- 
500 
MiSeq 
FGx 
MiSeq 
FGx 
Database 


Number 
of 
Kits markers 


Custom primers 


ForenSeq DNA 
Signature prep 
Kit 

Custom primers 


SifaMPS Panel 
Custom primers 


Custom primers 


ForenSeq DNA 
Signature prep 
Kit 


Details of 
markers 


23 A-STRs 


26 A-STRs, 24 
Y-STRs, and 
6 X-STRs 

186 SNPs + 123 
STRs 

87 STRs and 294 
SNPs 

15 A-STRs 


12 X-STRs 


27 A-STRs + 24 
Y-STRs + 7 
X-STRs 


Populations 


Korean 


Chinese Tibetan 


Chinese Han 

Chinese Han 

Chinese 
Northern Han 


Filipino male 
individuals 


Chinese Hui 


Sample 
numbers 


Years 


2021 


References 


Lee et al. (2021) 


Li et al. (2021) 2021 


Miao et al. (2021) | 2021 


Tao et al. (2021) 2021 


2018 


Zhang et al. (2018) 


Salvador et 2018 
al.(2018) 

Chen et al. 

(2021) 

2021 


8EL 


siskjeuly WNQ Ul! ABojouYde| (SON) Bulduanbas UoNelaUay 1xXeN 


Tools and techniques of using NGS platforms in forensic population genetic studies 


so there are abundant loci for meeting forensic needs. Compared with STRs, SNPs 
have a lower mutation rate and higher stability in the population, which are suitable 
for ancestry-informative inference. However, as a biallelic genetic marker, SNPs 
have lower polymorphisms at single locus than STRs, and thus require the combined 
detection of a larger number of SNP loci to achieve the accuracy of individual identi- 
fication. NGS has the characteristics of high throughput and accurate detection result, 
which can satisfy the requirements of simultaneous detection of a large number of SNPs 
and can also meet the needs of SNP typing with different purposes, which greatly im- 
proves the forensic efficiencies of SNPs in practice and also expands the application 
prospects of SNPs. According to the opinions of the 22nd International Conference 
on Forensic Genetics, the forensic applications of SNPs are currently mainly divided 
into individual identification SNPs, ancestry-informative SNPs (AISNPs), lineage- 
informative SNPs, and phenotype-informative SNPs (PISNPs). However, regardless 
of the type of SNP marker, its expected forensic efficacy has limited applicability in 
other populations (nontargeted design). For example, AISNPs which are effective for 
some populations may be ineffective in identifying other populations, so different 
AISNPs need to be screened for different populations. Zhang et al. (2017a,b) performed 
sequencing analyses of 273 and 231 SNP loci through the Ion Torrent PGM 
sequencing platform, proving that the platform and the genetic markers (SNPs) are 
of great value in individual identification and complex kinship identification, respec- 
tively. In recent years, more and more scientists have developed NGS-SNP systems 
to solve the problems of individual identification and biogeographic ancestry- 
informative inference. Li et al. (2017) screened 175 SNPs at the genome-wide level 
and evaluated the individual identification performance of the system in 54 reference 
groups in the world. The results denoted that the system had good individual identifi- 
cation efficiencies in these populations, and its average RMP can range from 
4.77 x 10-7! to 1.06 x 107®*. Guo et al. (2020) sequenced and analyzed 30 AISNPs 
in 127 Han individuals from Chinese Shaanxi province on the [lumina NextSeq 500 
platform. And the target group and the 22 reference populations from the 1000 Ge- 
nomes Project (1000 Genomes Release Phase 3) were combined to analyze the genetic 
background of the Chinese Shaanxi Han using population genetics statistical methods, 
such as STRUCTURE, Fsv, principal component analysis (PCA). And their results 
showed that Chinese Shaanxi Han population and East Asian populations have more 
similar ancestral components. Bulbul et al. (2018) analyzed 86 AISNP loci of 2568 in- 
dividuals from 39 groups from Northeastern Africa, southwestern Asia, and Europe 
(EUR) to the Ural Mountains and found that these 86 AISNPs can effectively evaluate 
population genetic differences in target regions and might be used to inference the 
ancestry informative. 

Based on the Ion Torrent PGM sequencing platform, Thermo Fisher has developed 
two commercial kits for individual identification and biogeographic ancestry informative 
inference which have received widespread attentions from researchers: the Precision ID 
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Identity Panel and Precision ID Ancestry Panel, respectively. The Precision ID Identity 
Panel consists of 90 autosomal SNPs and 34 Y-SNPs, while the Precision ID Ancestry 
Panel consists of 165 autosomal AISNPs. These two kits can be configured with auto- 
mated library construction and template preparation workflows for easy operation. In 
addition, the small amplicon fragments designed in the kit are also more suitable for 
the analysis of forensic degradation samples. At present, studies have reported that the 
Precision ID Identity Panel has great potential for forensic individual identification in 
Chinese Tibetans, Hui group (Liu et al., 2018), northern and southern Han populations 
(Guo et al., 2016; Li et al., 2018), Brazilians (Avila et al., 2019), and Indians (Dash et al., 
2022). Liu et al. (2018) found that the cumulative matching probability (CMP) values of 
124 SNPs in Tibetan and Hui populations in China were 2.5880 x 10> and 
4.6326 x 10-**, respectively. And their CPE values were 0.999999386152271 and 
0.999999696360182, respectively. A population genetic analysis based on the same 
124 SNPs found that the Tibetan and Hui groups had the close genetic relationships 
with East Asian populations. Similarly, the current studies also explored the applications 
in biogeographic ancestry informative inference on the Precision ID Ancestry Panel in 
Danes, Somalis (Pereira et al., 2017), Chinese Kazaks (Xie et al., 2020), Chinese Han, 
Chinese Li, Chinese Gelao (He et al., 2021), and seven other different Asian populations 
(Lee et al., 2018). Xie et al. (2020) sequenced and analyzed 165 AISNP genotypes of 96 
Chinese Kazak individuals, and found that the Chinese Kazak group has the mixed ances- 
tral components of East Asian and European, with a ratio of about 62:37. What’s more, 
the SNP-detecting systems designed by many scientists also have good performances in 
individual identification and ancestry-informative inference. For example, a system con- 
taining 1245 SNPs was constructed by Wu et al. (2019); a 165 Y-SNPs system was con- 
structed by Wang et al. (2019), a 144 SNPs and 20 MHs system was constructed by 
Phillips et al. (2019); a 153 SNPs system was constructed by de la Puente et al., 
(2021); and a 448-plex SNPs system was constructed by Zhao et al. (2021). In addition, 
the phenotypic characteristics of individuals can also be predicted by analyzing popula- 
tion SNP genotypes. Salvoro et al. (2019) evaluated the role of four models (IrisPlex, 
Ruiz, Allwood, and Hart) in the eye color prediction of 296 Italians through 24 
SNPs, respectively. Bulbul and Filoglu (2018) analyzed the ancestral informational com- 
ponents and phenotypic characteristics of populations in Southwest Asia, EUR, and 
other continents through 156 SNPs, and their results showed that the system could be 
used for ancestry-informative inference and prediction of the eye, hair, and skin colors 
of some regional populations. 

The development of NGS technology provides an accurate and reliable method for 
in-depth study of the application of SNPs in forensic individual identification, biogeo- 
graphic ancestry informative inference, and phenotype prediction. With the continuous 
innovation of NGS technology and the discovery of more high-discrimination SNPs, 
future forensic identification will be more accurate and efficient. Table 8.2. 


Table 8.2 SNP studies on NGS platforms. 


9 


10 


Machines 


Ion torrent 
PGM 
Ion torrent 
PGM 
Ion torrent 
PGM 

Tumina 
Nextseq 
Miseq 


Ion torrent 
PGM 


Ion torrent 
PGM 


Ion torrent 
PGM 


Ion torrent 
PGM 


Ion torrent 
PGM 


Kits 


Custom primers 
Custom primers 
Custom primers 
Custom primers 
Custom primers 


Precision ID 
Identity Panel 


HID-Ion 
AmpliSeq 
Identity Panel 

HID-lIon 
AmpliSeq 
Identity Panel 

HID-Ion 
AmpliSeq 
Identity Panel 

Precision ID 
Identity Panel 


Number 
of 
markers 


Details of markers 


Populations 


Chinese Han 
population 
Chinese Han 


Chinese 
Chinese Han 


39 reference 
populations 
Tibetan, Uygur, 

and Hui 
minority 
ethnicities of 
China 
Southern and 
Northern 
Chinese Han 
Southern 
Chinese 


Brazilian regional 
populations 


Indian 


Sample 
numbers 


References 


Zhang et al. 


(2017b) 


Zhang et al. 


(2017a) 
Li et al. 

(2017) 
Guo et al. 

(2020) 


Bulbul et al. 


(2018) 
Liu et al. 
(2018) 


Guo et al. 
(2016 


Li et al. 
(2018 


Avila et a 
(2019 


Dash et 2 
(2022 


Years 


2017 


2017 


2017 


2020 


2018 


2018 


2016 


2018 


2019 


2021 
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Table 8.2 SNP studies on NGS platforms.—cont’d 


No. 


11 


12 


13 


14 


15 


16 


17 


18 


Machines 


Ion torrent 
PGM 
Ion torrent 
PGM 
Ion torrent 
PGM 


Ion S5SXL 


Tumina 
MiSeq and 
HiSeq 

Ion torrent 
PGM 

Ion torrent 
PGM 


Miseq FGx 


Kits 


Precision ID 
Ancestry Panel 

Precision ID 
Ancestry Panel 

Precision ID 
Ancestry Panel 


HID-Ion 
AmpliSeq 
Ancestry Panel 


Custom primers 


Custom primers 


Custom primers 


Custom primers 


Number 
of 
markers 


165 


165 


165 


165 


1245 


165 


164 


153 


Details of markers 


165 SNPs 
165 SNPs 


165 SNPs 


165 SNPs 


1245 SNPs 


165 Y-SNPs 


144 SNPs + 20 
microhaplotypes 


153 SNPs 


Populations 


Danes and 
Somalis 
Chinese Kazak 


Haikou Han, 
Qiongzhong 
Hlai and 
Daozhen 
Gelao of 
China 

Southern China, 
Beijing, Japan, 
Korea, 
Vietnam, 
Nepal, India 
and Pakistan 

Chinese 


Unknown 


31 populations 
from 1 KG, 
SGDP and 
MAPIlex 
genotyping 

Individuals from 
Middle East, 
North, and 
East African 
regions 


Sample 
numbers 


240 
96 


316 


750 


210 


53 


2333 


167 


References 


Pereira et al. 
(2017) 

Xie et al. 
(2020) 

He et al. 
(2021) 


Lee et al. 
(2018) 


Wuet al. 
(2019) 


Wang et al. 
(2019) 


Phillips et al. 


(2019) 


de la Puente 
et al. 
(2021) 


Years 


2017 


2020 


2021 


2018 


2019 


2019 


2019 


2021 
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19 


20 


21 


22 


23 


BGISEQ- 
500RS 

Ion torrent 
PGM 


Ion torrent 
PGM 
Ion torrent 
PGM 
Miseq FGx 


Custom primers 


Custom primers 


Custom NGS- 
based panel 

Precision ID 
Ancestry Panel 

ForenSeq DNA 
Signature prep 
Kit 


448 


156 


24 


165 


165 


448-Plex SNP 


156 SNPs 


24 SNPs 


165 SNPs 


165 SNPs 


Chinese Han 


Turkey, Turkish 
Cypriots, 
Druze, Somali, 
Zaramo, 
Afghanistan, 
Spain 


Italian 
Unknown 


African 
American, East 
Asian,US 
Caucasian, 
Southwest US 
Hispanic 


142 


11 


296 


95 


714 


Zhao et al. 
(2021) 
Bulbul and 
Filoglu 
(2018) 


Salvoro et al. 


(2019) 


Al-Asfi et al. 


(2018) 
King et al. 
(2018) 


2021 


2018 


2019 


2018 


2018 
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Mitochondrial DNA 


MtDNA is a maternally inherited extra-nuclear genetic material. The mtDNA molecule 
is a circular double-stranded DNA with a length of 16,569 bp, which can be divided into 
two parts: the coding region and the noncoding region (also called the control region). 
Among them, the coding region is relatively conserved, and a total of 37 genes encode 
various RNAs and proteins. The sequence length of the control region (CR) of mt DNA 
is 1125 bp, and the mutation rate of the control region is relatively high. The control 
region is usually divided into three hypervariable regions: hypervariable region I 
(HVI), hypervariable region II (HVID, and hypervariable region HI (HVIII). Due to 
the characteristics of multiple copies, high mutation rate, non-recombination, and 
maternal inheritance, mtDNA plays an important role in the identification of forensic 
matrilineal kinship, the identifications of trace and degraded samples, and population ge- 
netic studies. In 2014, based on the NGS platform (Roche 454), Mikkelsen et al. (2014) 
obtained the accurate point mutation genotypes of mt DNA HV I and HV II as previ- 
ously detected with traditional CE sanger sequencing, single base extension (SNaPshot), 
and matrix-assisted laser desorption ionization time-of-flight mass spectrometry 
(MALDI-TOF). Subsequently, a number of large-scale population studies based on 
NGS platforms such as the Personal Genome Machine (PGM), Ion S5, Illumina MiSeq 
FGx, etc. confirmed that NGS can obtain highly reproducible and accurate genotyping 
of full mtDNA sequences (Chaitanya et al., 2015; Kim et al., 2018; Lopopolo et al., 2016; 
Park et al., 2017; Peck et al., 2016; Zhou et al., 2016). Accurate NGS-mtDNA typing of 
highly degraded samples, such as skeleton aged 50—70 years (Avila et al., 2022; Gorden 
et al., 2018), skeleton from the historical shipwreck of La Belle (Ambers et al., 2020), 
skeleton up to 8 kyas (Eduardoff et al., 2017) in multiple studies confirmed the suitability 
of mtDNA for forensically degraded samples. In addition, a large number of NGS- 
mtDNA-based population genetic studies of mother-child pairs and unrelated individuals 
have explored the mutation rate, heterogeneity, maternal inheritance pattern, and ge- 
netic relationships. The results have demonstrated that mtDNA exhibits a high degree 
of polymorphisms across diverse populations, making it a promising tool for individual 
identification and tracing matrilineal pedigrees (Ma et al., 2018; Park et al., 2017). 
Furthermore, a series of population studies have been carried out based on mtDNA hy- 
pervariable regions, control regions, and whole mtDNA genome to explore the genetic 
structures and evolutionary origins of different populations across continents from the 
perspective of maternal genetic material, providing supports for forensic ancestral infor- 
mative inference. Among them, a population study based on NGS-mtDNA on 
Greenland showed that the Inuits of Greenland were descended from Thule, and factors 
such as geographic isolation and genetic drift had a significant impact on the genetic di- 
versity of Greenlanders (Lopopolo et al., 2016). The Korean population analysis using the 
whole mtDNA sequence on the NGS platform showed that the Koreans have closer ge- 
netic relationships with the East Asian populations, suggesting that the Koreans have the 
same ancestral origin as the East Asians comparing with the individuals from other 
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continental populations (Park et al., 2017). A Sherpa-specific subhaplogroup Al5c1 was 
found in a NGS-mtDNA population genetic study on the Chinese Muli Tibetan, sug- 
gesting that the Sherpa people may have originated from the Tibetan group, while the 
mitochondrial haplogroup of the Chinese Muli Tibetan group shared with the Han pop- 
ulation, which may be due to intermarriage with each other (Wang et al., 2020). An 
NGS-mtDNA study of southern Brazil populations showed that in southern Brazil, there 
are populations which are farther away from the Brazilian populations but more genet- 
ically related to the European populations (Avila et al., 2022). In addition, EMPOP is a 
widely used forensic mtDNA database (https://empop.online). So far, EMPOP has 
collected 48,572 high-quality mtDNA microtypes, of which 46,963 cover HVS-I and 
HVS-II, 38,361 cover the control region, and 4289 are the whole mtDNA microtypes. 
Among the data collected by EMPOP, 19,605 mitotypes are from America, 13,721 are 
from EUR, 12,196 are from Asia, 2577 are from Africa (AFR), and 473 are from 
Australia and Oceania populations. Table 8.3 is a summary of forensic genetics in 
NGS-mtDNA population studies, including sequenced regions, study populations, pop- 
ulation sizes, NGS platforms, and references. 


Microhaplotype 


A MH isa kind of genetic marker consisting of at least two single nucleotide polymor- 
phisms (SNPs) within a certain range (300 bp) of DNA fragments. MH has several char- 
acteristics of no stutter peaks, abundant polymorphisms, and small amplicons, and also has 
the potential to be applied in forensic individual identification, mixture deconvolution, 
ancestral informative inference, and kinship testing. Because of these advantages, MH is 
an important type of genetic marker for population genetic research based on the NGS 
platform. Van der Gaag et al. (2018) developed a multiple amplification system contain- 
ing 16 MHs (at least 4 SNPs in the range of 70 bp), which can accurately classify the 276 
studied unrelated individuals into three great intercontinental origins namely EUR, 
AFR, and East Asia (EAS); additionally, this system confirmed the effectiveness of 
MHs in mixture deconvolution. In 2018, Peng et al. developed a multiple amplification 
system containing 25 MHs (amplicons less than 50 bp) based on the HiSeq X-10 platform 
(Chen et al., 2018). The results of population genetic study on unrelated individuals and 
artificially mixed samples of Chinese Han nationality showed that the sequencing depth 
of MHs has a significant correlation with the mixture ratio, and has the potential to 
deconvolve mixtures (Chen et al., 2018). Subsequently, a number of scientists based 
on MiSeq, Ion S5, and NextSeq 500 sequencing platforms developed a series systems con- 
taining 15 (Wang et al., 2022), 20 (Kureshi et al., 2020), 21 (Zou et al., 2022), 25 (Chen 
et al., 2019), 26 (Zhao et al., 2022), 30 (Wen et al., 2021), 40 (Yang et al., 2022), 59 
(Kureshi et al., 2020), 90 (Pakstis et al., 2021), 140 (de la Puente et al., 2017), and 124 
(Pang et al., 2020) MH loci. Based on the above systems, the researchers applied intercon- 
tinental populations (from AFR, EUR, EAS, South Asia, and the Americas), artificially 
simulated mixed samples, or degraded samples to confirm the effectiveness of individual 
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Table 8.3 Studies of mtDNA on NGS platforms. 


No. 


BONE 


OANA VI 


10 


11 
12 


13 


14 
15 
16 
17 
18 


19 
20 
21 


22 


Machines 


Roche 454 GS Junior 


Ion torrent PGM 
MiSeq 
MiSeq 


MiSeq 

Ion torrent PGM 
Ion torrent PGM 
Ion torrent PGM 
MiSeq FGx 


MiSeq 


Ion torrent PGM 
Ion torrent PGM 


MiSeq FGx 


Ion torrent PGM 
Ion S5 

Ion S5 

Ion torrent PGM 
MiSeq FGx 


Ion S5 
NextSeq 2000 
Ion S5 


MiSeq FGx 


Regions 


HV1 and HV2 

Whole mtDNA 
Whole mtDNA 
Whole mtDNA 


Whole mtDNA 
Whole mtDNA 


mtDNA control region 


Whole mtDNA 
Whole mtDNA 


Whole mtDNA 


Whole mtDNA 
Whole mtDNA 


Whole mtDNA 


Whole mtDNA 
Whole mtDNA 
Whole mtDNA 
Whole mtDNA 
Whole mtDNA 


Whole mtDNA 
Whole mtDNA 
Whole mtDNA 


Whole mtDNA 


Populations 


Somalia and Dane 

European 

Greenlandic 

African American, U.S. 
Caucasian, and U.S. Hispanic 

Estonian 

Chinese Northern Han 

DNA up to 8 kyrs of age 

Korean 

21 nonprobative, degraded 
skeletal sample (aged 50-70 
years) 

mother-child pair of European 
decent 

Known innate heteroplasmy 

mother-child pairs of European 
decent 

African American, US Caucasian, 
and US Hispanic 

Brazilian 

24 populations 

shipwreck of La Belle 

Romanian 

five US metapopulations African 
American, Caucasian, 
Hispanic, Asian American, and 
Native American 

Chinese Tibetan 

Southern Brazilian population 

Vietnam 52 age-old skeletal 
remain 

African American, Caucasian, and 
US Hispanic 


Sizes 


References 


Mikkelsen et al. (2014) 
Chaitanya et al. (2015) 
Lopopolo et al. (2016) 
Peck et al. (2016) 


Stoljarova et al. (2016) 
Zhou et al. (2016) 
Eduardoff et al. (2017) 
Park et al. (2017) 
Gorden et al. (2018) 


Holland et al. (2018) 


Kim et al. (2018) 
Ma et al. (2018) 


Peck et al. (2016) 


Avila et al. (2019) 
Strobl et al. (2019) 
Ambers et al. (2020) 
Melchionda et al. (2020) 
Taylor et al. (2020) 


Wang et al. (2020) 
Avila et al. (2022) 
Ta et al. (2021) 


Wisner et al. (2021) 


Years 


2014 
2015 
2016 
2016 


2016 
2016 
2017 
2017 
2018 


2018 


2018 
2018 


2018 


2019 
2019 
2020 
2020 
2020 


2020 
2021 
2021 


2021 


orl 
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identification, ancestral informative inference, degraded material identification, and 
mixture deconvolution, especially mixtures with large difference in proportion. The po- 
tential for secondary kinship analysis by MH markers has also been confirmed among the 
familiar population studies using the systems developed based on the NGS platform, 
which contained 13 (Zhu et al., 2019), 54, 59 (Wu et al., 2021), and 140 (de la Puente 
et al., 2017) MH markers (Sun et al., 2020; Wu et al., 2021). In addition, Chen et al. 
(2019) analyzed 10 MHs in the 1 KG database, showing that these 10 MHs can clearly 
distinguish the four intercontinental populations of AFR, EAS, South Asia, and EUR, 
and also indicating that these 10 MHs have the potential for ancestral informative infer- 
ence. A system comprising 90 MH markers developed by Gandotra et al. (2020) has 
demonstrated good discrimination ability between African, Asian, and European ances- 
tries. In addition, other several studies based on different continental populations also 
have further verified the efficiency of MH markers in forensic ancestry informative 
inference (Pakstis et al., 2021; Kidd et al., 2021; Wen et al., 2021; Zhao et al., 2022). 
The MH detecting system developed by Dan et al. based on NGS also included 
10 MHs containing phenotypic SNPs (piSNPs) alongside 10 conventional highly poly- 
morphic MH markers, allowing for the prediction of phenotypes (Wen et al., 2021). 
Based on the NGS platform, Jin et al. developed a complex detection system containing 
multiple types of markers (18 multi-allelic InDels, 27 MHs, and 36 Y-SNPs/InDels), 
showing good potential for individual identification, paternity testing, mixture deconvo- 
lution, ancestry informative inference, and Y haplogroup division (Jin et al., 2020). 
Table 8.4. 


Compound marker 


A compound marker refers to a new type of genetic marker formed by the combination 
of two different types of genetic markers, including several InDels, SNPs, and STRs 
within a range of less than 300 bp. Common compound markers include InDel-SNP, 
InDel-InDel, SNP-STR, InDel-STR, etc. The NGS platform takes into account the 
length and sequence information of the genetic markers and provides an ideal approach 
for the detection of composite markers. Based on the [lumina X-10 platform, Jin et al. 
developed a detection system for forensic individual identification containing 10 MHs 
and 13 InDel-SNPs. The CPD value of this system in Chinese Shaanxi Han population 
reaches 0.999999999994835, which indicates the potential of individual identification 
(Jin et al., 2020). The 23-plex system showed high polymorphism patterns in other 26 
populations from five continents (AFR, America, EAS, South Asia, and EUR), suggest- 
ing that the system has a high potential for individual identification in major interconti- 
nental populations around the world. In addition, phylogenetic analysis of five 
continental individuals based on 23-plex system loci showed that African populations 
were independently divided into a large clade, European and American populations 
were clustered into a large clade, and South Asian and East Asian populations clustered 
into a large clade. In addition, South Asia and EAS populations are further divided 
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Table 8.4 Studies of microhaplotypes on NGS platforms. 


Number of 
No. Platforms markers Details of markers Populations Sizes Years References 
1 | HiSeq X10 14 A-MHs Chinese Han and 2 60 2018 | Chen et al. 
artificially prepared (2018) 
mixture samples 
2 | MiSeq 16 A-MHs 276 samples of 3 373 | 2018 | van der Gaag 
globally dispersed et al. (2018) 
populations (5 
Dutch, 3 
Bhutanese, 2 
Ghanese, 2 pygmy 
and 3 Amer- indian 
sample), 97 samples 
of 9 large families 
3 | HiSeq X10 25 A-MHs Chinese Han 60 2019 | Chen et al. 
(2019) 
4 | Database 10 A-MHs 1KG 2504 | 2019 | Chen et al. 
(2019) 
5 | MiSeq 13 A-MHs Chinese individuals 37 2019 | Zhu et al. (2019) 
6 | MiSeq and Ion S5 118 107 A-MHs + 11 X- |} Cell line 5 2020 | de la Puente et al. 
MHs (2020) 
7 | MiSeq 90 A-MHs 7 African populations | 156 | 2020 | Gandotra et al. 
(Biaka, Chagga, (2020) 


Sandawe, Zaramo), 
Asia (tatwanese 
Chinese), 
individuals of mixed 
European ancestry 
(EuroAmericans), 
and an Eastern 
European 
population (Adygei) 
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11 


12 


13 


14 


15 


16 
17 


MiSeq PE300 


MiSeq FGx 
MiSeq PE300 


MGISEQ-2000RS 


MiSeq 
MiSeq 


MiSeq FGx 


NextSeq 500 
Miseq PE300 platform 


20 


129 


151 


90 


30 


54 


59 
26 


A-MHs 


124 A-MHs+20 STRs 
A-MHs 


48 AISNPs, 18 multi- 
allelic InDels, 27 
microhaplotypes 
and 36 Y-SNPs/ 
InDels 

83 AISNPs + 68 A- 
MHs 

90 A-MHs 


30 A-MHs including 
10 phenotypic MHs 
54 A-MHs 


59 A-MHs 

26 A-MHs, non binary 
SNPs with 
molecular extent 
sizes no longer than 
60 bases 


Chinese Han 50 
unrelated 
individuals and 12 
parent/child duo 

Chinese Han 

54 unrelated random 
Chinese Han +38 
parent-child duos 
and 55 uncle/aunt/ 
grandparent-child 
duos 

120 Shaanxi Han, 100 
Mongolian and 115 
Hui individuals 


95 population of Kidd 
Lab 

16 populations of Kidd 
Lab 

Northern Han 
Chinese 

Two donor families 
from the northern 
Han Chinese 
population 
containing 67 
individuals 

Han Chinese 

Northern Han 
Chinese 


74 


335 


3790 


524 


100 


67 


187 
100 


2020 


2020 
2020 


2021 


2021 


2021 


2021 


2021 


2021 
2021 


Kureshi et al. 
(2020) 


Pang et al. (2020) 
Sun et al. (2020) 


Jin et al. (2022) 


Kidd et al. (2021) 

Pakstis et al. 
(2021) 

Wen et al. (2021) 


Wu et al. (2021) 


Wu et al. (2021) 
Zhao et al. 
(2022) 


Continued 
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Table 8.4 Studies of microhaplotypes on NGS platforms.—cont’d 


No. Platforms 


18 | A SNaPshot and phase 
workflow/Ion S5 
XL system 


19 | Ion S5 XL System/ 
MuinION nanopore 
sequencing 

20 | HiSeq X-10 


Details of markers 


21 A-MHs 


15 A-MHs 


40 A-MHs 


Populations 


10 Chinese 


populations (74 
Chengdu Hans, 76 
Dujiangyan 
Tibetans, 77 Muli 
Tibetans, 78 
Xichang Yis, 78 
Wuzhong Huis, 63 
Zunyi Gelaos, 78 
Hainan Lis, 80 
Hainan Hans, 73 
Ordos Mongolians 
and 87 tibet 
Sherpas) 


Chinese Han 


Chinese Han 


population 


Sizes 


Years 


References 


Zou et al. (2022) 


Wang et al. 
(2022) 


Yang et al. 
(2022) 
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into two independent subclades, showing the 23-plex system’s good ancestry informative 
inference ability and the potential to distinguish the ancestral origins of four interconti- 
nental populations in AFR, EAS, South Asia, and EUR (Jin et al., 2020). Based on the 
NextSeq 500 platform, Jin et al. (2022) developed a complex amplification system con- 
taining 29 genetic markers (22 MHs and 7 compound markers); the heterozygosity values 
of the 29 markers in 26 reference populations and Chinese two populations (Kirgiz and 
Mongolian) were greater than 0.4. The CMP values of these markers ranged from 
3.1928E~* in the East Asian populations to 1.7023E”” in the European populations. 
The CMP values were smaller than the values in previous 30 InDels and 35 InDels sys- 
tems, suggesting the polymorphism of the 29-plex system and the efficacy of forensic ap- 
plications such as individual identification and paternity testing. In addition, 29 markers 
showed large genetic differences in different continental populations, which is an accu- 
rate and effective forensic application system for distinguishing four continental popula- 
tions in AFR, EAS, South Asia, and EUR (Jin et al., 2022) (Table 8.5). 


Statistical analytical methods of population genetic study 


Statistical analytical methods of population genetic study mainly include genetic differ- 
entiation index (Fs) estimation, F test, multidimensional scaling (MDS) analysis, 
PCA, phylogenetic tree construction, and genetic mixed model prediction. 


Genetic differentiation index (Fs7) 


The genetic differentiation index (Fg) is a widely used index to measure the degree of 
genetic differentiations among populations. In those populations which conform to 
Hardy-Weinberg equilibrium, the larger the Fs value between each other, the greater 
the difference, which is suitable for the comparison of diversity among subgroups. Fy 
evolved from F-statistics. There are three main types of F-statistics (Fis, Fre, Fs). Fsr 
is designed for the biallelic markers. If there are multiple alleles at the locus, it needs 
to be measured by the gene differentiation coefficient (Gs). Assuming that there are 


Table 8.5 Studies of compound markers on NGS platforms. 


Number 
of Details of 
No. Machines = markers markers Populations Sizes Years References 


1 NextSeq 22 MHs + 7 112 Kazaks 2020 | Jin et al. 
500 compound and 106 (2020) 

markers Mongolian 
2 Tumina 10 MHs + 13 | Shaanxi Han 2020 | Jin et al. 
X- 10 compound population (2020) 


markers 
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populations, the relative size of the k-th population is w*, the ith allele frequency in the 
k-th population is q‘(i), and the observed heterozygote frequency is h‘, then the 
observed mean heterozygote frequency value in the whole population is HI, the mean 
expected heterozygote frequency of the population is HS, and the expected mean value 
of the heterozygote frequency of the entire population is HT, they are: 

Fis is the ratio of HI to HS reduction, or the mean inbreeding coefficient of the local 
populations. 

Fsris the ratio of HS to HT reduction, that is, the average inbreeding coefficient be- 
tween related local populations. 

Frr is the ratio of HI to HT reduction, the average inbreeding coefficient for the 
entire populations. 

The Fy value ranges from 0 to 1, with a maximum value of 1 indicating complete 
differentiation between the two populations; and a minimum value of 0 indicating no 
differentiation between the populations. In actual research, when the Fsry value is 
O—0.05, it indicates that the genetic differentiation between these two populations is 
very small and can be ignored; when it is 0.05—0.15, it indicates that there are moderate 
genetic differentiations between populations; when it is 0.15—0.25, there are large ge- 
netic differentiations between populations; when it is above 0.25, there is a large genetic 
differentiation between populations. The calculation of Fs generally uses VCFtools soft- 
ware, genepop software, Arlequin software, and so on. 


F test 


The F test mainly includes three-group test and four-group test. The three-group test f3 
is a powerful method for analyzing population genetic data and providing reliable evi- 
dence for population mixing events, even those which have occurred within the last 
few hundred generations. Firstly, the {38 (X, Y; Outgroup) test is usually used to analyze 
the shared gene drift between population X and population Y. A significant positive 
value indicates that after population X and population Y are separated from the other 
populations, gene exchange has occurred between the two populations (X and Y); 
then, we used the f3 (X, Y; and Z) test to analyze the admixture event of population 
Z; while significant negative value indicates that population Z is mixed by related pop- 
ulations of X and Y populations. The four-group test f4 can not only analyze population 
mixing but also provide discernible information on the direction of gene flow. Two types 
of f4 statistical tests are usually used in population studies: the symmetry f4 test and the 
affinity f4 test. The detection units of the symmetry f4 test include test group 1, test group 
2; reference group and outgroup, and the outgroup in the analysis will use Yoruba or 
Mbuti as the outgroup according to different needs. If the result of the symmetrical 
f4 test yields a nonsignificant statistical value (absolute Z value < 3), it indicates that 
the count of derivative alleles (BABA) shared between the reference population and 
test population 1 is equal to the count of derived alleles (ABBA) shared between the 
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reference population and test population 2. Consequently, test population 1 and test 
population 2 form a genetically closer clade. A statistical significance positive value 
(Z > 3) indicates that the test population 1 contains relatively high genetic component 
related to the reference population. A statistical significance negative value (Z < —3) in- 
dicates that the test population 2 contains a relatively high genetic component associated 
with the reference population. The detection units of the affinity four-group test include 
reference group 1, reference group 2; test group and outgroup, and the statistically sig- 
nificant positive value indicates that the test group and reference group 1 share more al- 
leles; whereas statistical significance negative value indicates that more alleles are shared 
between the test group and the reference group 2. 


Multidimensional scaling 


MDS analysis uses the similarity between pairs of samples to construct a low-dimensional 
space, so that the “distance” of the samples in this low-dimensional space and the simi- 
larity between samples in the high-dimensional space are as consistent as possible. MDS 
represents the physical spatial distributions of clusters in a population. This spatial distri- 
bution can represent the genetic differentiation pattern between populations to a certain 
extent. By directly comparing the two-dimensional vector map of MDS to analyze the 
relationship between genetic polymorphisms and the relationships between populations. 
The MDS analysis results are measured by the stress coefficient (stress). The MDS graph 
usually gives the stress value of the model, which is used to access whether the graph can 
accurately reflect the true distributions of data. It is generally believed that when 
stress < 0.2, it can be represented by a two-dimensional point graph of MDS, and its 
graph has certain explanatory significance; when stress < 0.1, it can be considered a 
good ranking; when stress < 0.05, it has a good representation. The MDS graph type 
is a scatter plot; the dots in the graph represent samples, and different colors/shapes repre- 
sent group information of different samples. The distance between the sample points in 
the same group indicates the representation of the sample, and the distance between the 
grouped samples reflects the difference between groups. Currently, MDS analysis is 
commonly performed using softwares such as SPSS and R. 


Principal component analysis 


PCA is a widely used analysis method in forensic genetics. In 1995, Cavalli-Sforza et al. 
used PCA analysis for the first time to analyze complex population data on a global scale 
with multiple loci (Jolliffe, 2002). The study used dimensionality reduction method to 
preserve the variables between groups in the data as much as possible and intuitively 
reflect these variables on a two-dimensional plane. As a commonly used genetic statistics 
method, PCA calculates a new set of variables called principal components (PC) on the 
basis of the original data. Each new PC contributes a part of the variance and is arranged 
as PC1, PC2, ... PCn (Cann, 1995). In the application of forensic genetics, PCA analysis 
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uses an intuitive visualization method to evaluate the degree of differentiation between 
populations and effectively reveals the genetic polymorphisms between populations 
through the relative positions of different populations. The research on European pop- 
ulation structure stratification by Novembre and Lao showed that the PCA analysis 
method can effectively reveal the differentiations between populations (Pritchard et 
al., 2000). Although some researchers still believe that the purpose of the PCA analysis 
is to directly superimpose data variables on geographic space, leading to limited matching 
results that may not accurately reveal the relative degree of differentiation between pop- 
ulations. But in general, PCA analysis still provides an intuitive method to reveal the dif- 
ferentiation between populations in the field of forensic genetics. If there are large genetic 
differentiations between the study populations and the selected molecular genetic 
markers are sufficiently different, then when the PCA algorithm is used, the discrete clus- 
tering points formed by each sample in the two-dimensional space are enough to reflect 
the degree of genetic differentiation between populations. Common softwares for con- 
structing PCA diagram are STRAF, MVSP, ORIGIN, R, PLINK, etc. 


Phylogenetic tree 


A phylogenetic tree is a genetic statistical method that uses a tree-like branch diagram to 
represent the genetic relationships between organisms or populations. In the field of 
forensic genetics, a phylogenetic tree is mainly constructed based on DNA sequences 
and DNA genotypes. The purpose of this algorithm is to reconstruct the ancestral 
sequence trait and measure the degree of divergence between different individuals or 
populations from the same ancestor using the branches of the dendrogram. The “leaf” 
node represents its degree of divergence. The branch of the dendrogram represents 
different subgroups, and the length of the branches represents the genetic distance be- 
tween the populations. According to the presence or absence of “root”, phylogenetic 
trees are divided into “rooted tree” and “unrooted tree”. “Rooted trees” can also 
describe the temporal order of populations or genes, while “unrooted trees” mainly 
reflect the distances between populations (Fordyce et al., 2011; Seo et al., 2013). At pre- 
sent, the construction methods of phylogenetic trees include the following categories: 
distance method, maximum likelihood (ML), maximum parsimony (MP), and Bayesian 
inference (BI). The distance method calculates the distance matrix between the two se- 
quences, which first repeatedly merges the two sequences with the shortest distance, and 
finally constructs the optimal tree. The algorithm is simple and easy to understand, and its 
calculation speed is fast. Unweighted pair-group method with arithmetic means 
(UPGMA) first classifies each sequence into a class, finds the two classes with the closest 
distance, and classifies them into a column, and then repeating this process until all clus- 
ters are added, and finally, a tree root is generated. However, this method is less 
commonly used nowadays. NJ is a commonly used method at present. It not only cal- 
culates the pairwise comparison distance but also minimizes the length of the entire 
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tree, thereby limiting the topology of the tree and overcoming the defect that the 
UPGMA algorithm requires a constant evolution rate. The NJ method is an algorithm 
based on the principle of minimal evolution (sequence homology). “Adjacent-joining” 
means that two taxa on a phylogenetic tree are connected by only one internal node. 
This method is relatively accurate, computationally fast, and makes minimal assumptions. 
However, because it treats all sites on the sequence indiscriminately and the evolutionary 
distances of the analyzed sequences are limited, the NJ method is suitable for short se- 
quences with small evolutionary distances and fewer informative sites. The ML method 
is the most consistent algorithm with evolutionary facts, but it is computationally inten- 
sive and time-consuming, which is not conducive to large-scale population structure 
analysis. However, since its inference tree is not unique, it is not suitable for sequences 
with large variations, so the algorithm is only suitable for samples with long sequences 
of small differences in sequences and similar mutation rates. The MP method is estab- 
lished based on the hypothesis of the least number of base substitutions in the evolu- 
tionary process. It is originally used for the study of phenotypes. The principle of the 
MP method is to determine the shortest evolutionary tree by using the evolutionary prin- 
ciple with the least amount of change in traits among the most similar individuals. The BI 
method is based on the statistical inference of the evolutionary model to deal with the 
complex and close to the actual situation in the model. The algorithm integrates the 
existing phylogenetic data and intuitively reflects the reliability of each branch through 
the posterior probability. The advantage of the BI method is that it does not need to 
be tested by the bootstrap method, and is more sensitive to evolutionary model, which 
is suitable for large and complex datasets. The most commonly used softwares for phylo- 
genetic trees are TreeMix software, MEGA software, and PHYLIP software. 


Structure analysis 


Structure analysis is a statistical method that uses genetic markers to infer the genetic 
structure ofa population. The basic idea of STRUCTURE is to first obtain the genotypes 
of a series of samples. In most cases, researchers do not know how many subgroups the 
population actually contains. The number of subgroups of the population is the K value, 
and the preset number of subgroups is 1-n, that is, K=1~n. Then the software will 
use the Bayesian algorithm to calculate how the populations are grouped and what the 
ancestral composition of each individual is under the condition of sumulating K = 1~n. 

At present, there are three softwares which can accomplish population structure anal- 
ysis, namely STRUCTURE, ADMIXTURE and fast STRUCTURE. STRUCTURE 
software is developed based on the Bayesian algorithm and applied to genetic clustering 
statistics. STRUCTURE analysis is usually used to evaluate sample with mixed ancestral 
background and can provide a clearer composition ratio for individuals with mixed 
ancestral background. The software works based on the original genotyping results of 
STRs or SNPs and clusters the populations under the conditions of different population 
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clustering coefficients (K) (Reich et al., 2008; Fondevila et al., 2013). The STRUC- 
TURE software can generate a cluster grouping coefficient matrix in one operation 
and compare the unknown sample with the reference population. This comparison 
does not consider the regional classification between the populations, but clusters the 
samples entirely based on genetic similarity. Therefore, this algorithm is often used to 
evaluate the discrimination efficiency of a specific combination of molecular genetic 
markers in related populations. Ifthe cluster match the actual geographic source division, 
this set of molecular genetic markers can be considered to have good discriminating po- 
wer for related populations. In addition, the STRUCTURE analysis is random, which 
means that multiple runs will produce different results (different values), but this does 
not affect the evaluation results of the genetic similarity between populations by this al- 
gorithm (Rosenberg, 2004; Jakobsson & Rosenberg, 2007). Using the DISTRUCT soft- 
ware (Kalinowski, 2011), the result of the STRUCTURE software can be visualized and 
presented in the form of an intuitive picture. In addition, the calculation result of 
STRUCTURE can be optimized by the CLIMPP software (Boeckmann et al., 2015), 
the stability of the population clustering coefficient K can be systematically measured 
by the online software HARVESTR, and the optimal K value corresponding to the pop- 
ulation data can be screened out to avoid the presumption of the population affecting the 
results (Meng et al., 2011). In addition, in order to more comprehensively trace popula- 
tion evolution information, software such as Phase is usually used to perform haplotype 
phasing on the whole genome data to obtain the length or number of haplotype blocks 
required for subsequent analysis. Further, a comprehensive reconstruction analysis of 
population genetic affinity was performed using software such as fat STRUCTURE. 

Another widely used software for genetic mixture model construction is ADMIX- 
TOOLS, which has strong applicability to the large-throughput data generated by the 
NGS platform. ADMIX TOOLS is a software package developed in the laboratory of Da- 
vid Reich (Patterson et al., 2012). And it is designed for computing and testing population 
mixing statistics and validating hybridization events. Among them, qpWave and qpAdm 
are two programs on ADMIXTOOLS which can determine the minimum number of 
ancestral sources contributing to the studied population and calculate the proportions of 
ancestral components. Populations with distant kinships in the studied population and 
no recent admixtures among them are selected as an outgroup, and qpWave and qpAdm 
are used to construct a mixed model of the studied population. Qp Wave focuses more on 
exploring the use of the smallest ancestral population to interpret the genetic variation in- 
formation contained in the target population. However, qpAdm mainly verifies the mix- 
ing ratio of a specific population under the formulation of the ancestral population. 
QpGraph is a hybrid modeling method that calculates allele frequency correlation patterns 
based on f2, f3, and f4 statistics stmultaneously. The latest software version has been 
updated to the second-generation automatic or semiautomatic intelligent analysis system. 
Its genetic evolution model is superior to the traditionally used TreeMix. 

With the development of NGS platform and the reduction of single-sample 
sequencing cost, the output of whole-genome data has currently increased exponentially. 
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MSMC, MSMC2, Momi, Momi2, and other softwares can analyze population segrega- 
tion time, changes in effective population size, and complex population structure using 
genome-wide sample frequency maps or phased haplotype data. The above softwares are 
mainly developed based on the Linux system. 


Learning practices of population study based on NGS platform for 
developing a 40 ancestry informative SNPs (AI-SNPs) multiplex panel 
for forensic ancestral inference on 17 different continental 
populations 


Taking the construction of an ancestral prediction system as an example, we will briefly 
introduce some methods and basic workflow of population genetic study hereby. The 
general workflow of a population study is as follows: (1) Selecting a set of SNP loci asso- 
ciated with different geographical origins within the scope of the whole human genome; 
(2) Analyzing the genetic structures of 17 continental populations from AFR, EUR, and 
EAS based on the raw genotype data of the selected SNPs provided by the 1000 Genome 
Project database; (3) Evaluating the efficacy of the selected same loci when predicting the 
biogeographical origins of continents, as well as investigating the genetic relationships and 
backgrounds of the 17 populations from the three continents. 


Assessing the contributions of the 40 AI-SNPs for ancestral information 
inference 

The criteria for the selection of SNP loci are as follows: (1) The allele frequency differ- 
ences (delta, 5) between any pairwise continental populations from AFR, EUR, and EAS 
are greater than 0.4; (2) All of the selected loci need to be not only biallele SNP loci but 
also need to be located in the intron region, so as to reduce the risks of bias in genotyping 
caused by diseases, etc.; (3) The physical distances of pairwise loci from the same chro- 
mosome must be greater than 10 Mb, and there is no linkage between the selected 
loci; (4) All of the SNPs must not display the deviations from Hardy-Weinberg equilib- 
rium in the East Asian populations; (5) No linkage disequilibrium is found between any 
two SNP loci. 

Based on the above criteria, 40 AI-SNP loci for ancestral inference of populations 
from the AFR, EUR, and EAS are finally confirmed. Utilizing the 1000 Genome Project 
data of the 17 reference populations from three continents, the selected 40 AI-SNPs are 
evaluated for their efficacies in ancestral inference. The Cos~ values of 40 AI-SNPs ob- 
tained from the PCA analysis based on the genotype frequency data of the studied pop- 
ulations are shown in Fig. 8.1. The Cos” values are commonly used to compare the 
contribution of each SNP to analyze the population differentiations, and the larger the 
Cos” value is, the farther the corresponding SNP is from the origin point and closer to 
the periphery, indicating that this SNP performs better in explaining the population dif- 
ferentiation. As shown in this figure, the Cos” values range from 0.94 to 0.99, and all of 
the selected 40 AI-SNPs roughly distribute around the periphery of the circle, indicating 
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Figure 8.1 Cos” values of 40 AI-SNPs in 17 different continental populations. 


that the contribution levels of these 40 AI-SNPs are relatively high and balanced. This 
indicates that their combination effect is good, and further enhances their applicability 
to ancestry-informative inference. This balanced distribution of contributions from the 
40 AI-SNPs ensures that the prediction is not dominated by a single SNP, making the 
overall inference performance more robust and accurate. 


Gene frequency statistics of 40 AI-SNP loci in the 17 populations from three 
continents 

The gene frequencies of selected 40 AI-SNPs in the 17 continental populations from 
AFR, EUR, and EAS are statistically analyzed, and the related frequency heat map is 
illustrated in Fig. 8.2. The color of each grid represents the allelic frequency value of 
the corresponding SNP in the assigned population, and according to the yellow-blue 
gradient scale on the right, the yellower the grid, the higher the frequency, and vice versa. 
The color blocks on the left indicate the continental origins of these populations, with 
African, European, and East Asian populations labeled as green, blue, and pink, respec- 
tively. As shown in the figure, the frequency distributions of the selected 40 AI-SNPs 
display significant differences among the three continental populations. It can be seen 
from the frequency-based clustering results on the left edge that the 17 studied 
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Figure 8.2 Gene frequency heatmap of 40 AI-SNPs in 17 different continental populations. The pop- 
ulation names and the corresponding abbreviations are as follows: ACB, African Caribbean in 
Barbados; ASW, African Ancestry in Southwest US; CDX, Xishuangbanna Dai Chinese; CEU, Utah resi- 
dents with Northern and Western European ancestry (France); CHB, Beijing Han Chinese; CHS, South- 
ern Han Chinese; ESN, Esan in Nigeria; FIN, Finnish in Finland; GBR, British in England and Scotland; 
GWD, Gambian in Western Division-Mandinka; /BS, Iberian populations in Spain; JPT, Tokyo Japanese; 
KHV, Ho Chi Minh Kinh; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; TS/, Toscani in Italy; 
YRI, Yoruba in Ibadan, Nigeria. 


populations are mainly clustered into three sub-branches, and the populations from the 
same continent are obviously clustered together and separated from other interconti- 
nental populations. Therefore, the selected 40 AI-SNPs may have the potential ability 
in distinguishing African, East Asian, and European populations. 


Investigation of the genetic relationships among the 17 populations from 
three continents using 40 AI-SNPs 

Population differentiation measures of the 17 continental populations 

The pairwise Fsr is a basic measure of population differentiation, and the related pairwise 
F sr heat map with two symmetrical clustering trees of the 17 studied populations is demon- 
strated in Fig. 8.3A. The two symmetrical clustering trees of Fs are both divided into the 
same three sub-branches, with the East Asian populations cluster located at the top, the Eu- 
ropean populations cluster in the middle, and the African populations cluster at the bottom. 
The populations from the same continent cluster with each other and are separated from the 
populations on other continents. The pairwise F's7 values between the 17 continental pop- 
ulations range from 0 to 0.911, with an average of 0.508. The smallest Fs value is found 
between the ESN and YRI populations from AFR, whereas the largest Fs7 value is observed 
between the ESN population from AFR and the CHS from EAS. In African populations, 
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Figure 8.3 Heatmaps of the pairwise F<; (A) and Dy (B) values in 17 different continental populations 
calculated with the 40 AI-SNPs genotypes. 


the Fsy-values range from 0 to 0.048, with an average of 0.017. In European populations, the 
Fsrvalues vary from 0.001 to 0.018, with a mean of 0.008. As for East Asian populations, the 
Fsr values range from 0.002 to 0.030, with an average of 0.012. In summary, the results of 
Fsr analysis reveal less genetic differentiations and closer genetic relationships among popu- 
lations from the same continent, while greater differentiation coefficients and more distant 
genetic relationships were observed among populations from different continents. 

As another widely used measure of population genetic distance is based on the infinite 
gene mutation model, the pairwise Dy value matrix of 17 populations is also presented as 
a heatmap (Fig. 8.3B). According to the pairwise D, values, the populations from the 
three continents cluster into three sub-branches, and the populations with the same 
geographical origin cluster together and separate from the populations on other conti- 
nents. The pairwise Dy values of the 17 populations range from 0.0015 to 0.2594, 
with the highest value found between the ESN from AFR and the CHB from EAS, 
and the lowest value observed between the two European populations FIN and CEU. 
Similar to the Fs analysis results, the distribution patterns of lower D4 values among 
populations from the same continent indicate closer genetic distances and possibly 
more gene exchanges, while the patterns of higher D4 values among populations from 
different continents suggest longer genetic distances among these populations than 
among those populations within the same continent. 


PCA analyses of the 17 continental populations based on 40 AI-SNPs 

PCA analyses of the studied 17 populations from three continents are performed at the 
population level and individual level, respectively (Fig. 8.4). As shown in the PCA analysis 
of population level, the first two PCs account for 95.1% of the total variation (61.76% PC1 
and 36.48% PC2). Points with red, blue, and green colors represent the populations from 
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Figure 8.4 Population (A) and individual (B) levels of PCA analyses in the 17 continental populations 
using 40 AI-SNPs. 


AFR, EUR, and EAS, respectively, and it can be seen from Fig. 8.4A that the populations 
from the same geographical region cluster together. AFR and EAS populations group in 
the left and right quadrants, respectively, while the EUR populations cluster in the middle. 
In Fig. 8.4B, 1668 individuals are classified according to their continental origins, which are 
labeled with different colors; in other words, red, blue, and green represent individuals 
from AFR, EUR, and EAS, respectively. Individuals from the same continent cluster 
together and separate from populations with different continental origins. 


Phylogenetic analyses of the 17 continental populations based on 40 AI-SNPs 
Phylogenetic analysis is an important method of genetic studies, which applies the branch 
diagram to illustrate the genetic relationships between different species. In this section, 
MEGA software and TreeMix software are used to reconstruct the phylogenetic trees 
of the studied populations on the basis of NJ and ML algorithms and to further analyze 
the phylogenetic relationships among the 17 populations. 

In the NJ tree based on pairwise Fy values of 17 populations, the populations from 
the same continent are annotated by the same background color, and these populations 
mainly distribute in three major branches of the tree, namely the African, European, and 
East Asian branches (Fig. 8.5A). In each major branch, the subgroups from the same 
continent cluster together. As for the unrooted ML tree based on ML atio 
(Fig. 8.5B), populations from the same continent are also labeled with the same color ac- 
cording to the caption on the right. It can be inferred from the figure that the 7 popu- 
lations from AFR cluster together at the top of the tree; the 5 populations clustered in the 
middle are from EUR; while the East Asian populations cluster at the bottom of the tree. 
Overall, both of the phylogenetic trees constructed based on NJ and ML algorithms can 
effectively distinguish the populations from the three continents, but there are some in- 
consistencies when it comes to the more in-depth estimations of relationships within 
populations from the same continent. 
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Figure 8.5 The NJ tree (A) and inferred ML phylogenetic tree (B) were conducted based on 17 con- 
tinental populations. 


In the NJ tree, ESN and YRI show a close genetic distance among the 7 populations 
in AFR and then cluster with LWK, MSL, GWD, ACB, and ASW successively. In the 
ML tree, the African populations are divided into three branches, wherein the ESN, 
YRI, MSL, and LWK display closer genetic distances and then cluster together with 
the GWD, ACB, and ASW, respectively. For populations from EUR and EAS, the ge- 
netic relationships among different populations estimated by the NJ and ML methods are 
similar. Within the European populations, the closest genetic distance is observed be- 
tween TSI and CEU, and the two populations then cluster with GBR, FIN, and IBS 
into the European branch of the NJ tree. As for the ML tree, TSI still has the closest ge- 
netic relationship with CEU, forming a small subbranch, while GBR and FIN display a 
close genetic distance, forming another small subbranch. Then, these two sub-branches 
cluster together with IBS into the European branch. Among the five populations from 
EAS, JPT and CHB have the closest genetic distance in the NJ tree and subsequently 
cluster together with CHS, CDX, and KHV, forming the East Asian branch. As for 
the ML tree, Chinese Han populations of CHS and CHB share the most recent ancestry 
and later cluster together with JPT, CDX, and KHV. 


Structure analysis of the 17 continental populations with 40 Al-SNPs 

Based on 40 AI-SNPs, aSTRUCTURE analysis is performed in the 17 continental pop- 
ulations (Fig. 8.6). According to the highest estimated values of Delta K and Prob (K), the 
optimal K is eventually determined to be 3. When K = 2, populations with the domi- 
nant East Asian ancestral component (green) are separated from the others with different 
geographical origins, while the African (orange) and European (red) ancestral compo- 
nents are later identified when K = 3. However, no new groups with novel ancestral 
component patterns are further determined by the increasing numbers of K (4—7). As 
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Figure 8.6 STRUCTURE analyses (K = 2—7) on individual level based on 40 AI-SNPs genotype data of 


the 1668 unrelated individuals from 17 continental populations (A); Cluster analysis for the results of 
STRUCTURE analysis on individual level when K = 3 (B). 
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shown in the cluster analysis (Fig. 8.6), individuals from East Asian (red), European (blue), 
and African (yellow) populations distribute at the three corners of the ternary diagram. 
Individuals from the same continent congregate with each other and separate from indi- 
viduals with other continental origins. STRUCTURE and cluster results further demon- 
strate that the selected 40 AI-SNPs have a promising performance in distinguishing the 
African, European, and East Asian populations. 


Ancestral inference using the selected 40 AI-SNPs 

Based on the genotyping data of 17 populations from the three continents, a supervised 
learning algorithm of “Random Forest” is applied to evaluate the efficacy of the combi- 
nation of 40 AI-SNPs system as a classifier of continental populations by randomly 
selected 75% of all the individuals from each continent as the training set and the remain- 
ing 25% as the testing set. As a result, the selected 40 AI-SNPs are almost 100% accurate 
(95% CI: 0.9912—1) when distinguishing the studied populations from the three conti- 
nents. The prediction accuracy of the random forest model based on 40 AI-SNPs for 
populations from the three continents is shown in Table 8.6. Except for one African in- 
dividual who is incorrectly assigned to be European, the geographical origins of the 
remaining 415 individuals from the testing set are correctly identified. Evaluation metrics 
of the random forest predictive model of ancestral origin inference based on the 40 AI- 
SNPs are displayed in Table 8.7. It can be concluded from the table that the constructed 
model exhibits considerably high sensitivity and specificity when predicting the testing 
set, and the false negative and false positive rates are rather low. 


Table 8.6 Predictions of ancestral origins for populations from the three continents based on the 
random forest model using 40 AI-SNPs. 


Reference prediction Africa East Asia Europe 
Africa : 

East Asia 0 
Europe 125 


Table 8.7 Evaluation metrics of the random forest predictive model of ancestral origins based on the 
40 AI-SNPs. 


Class African East Asian European 
Sensitivity 1 
Specificity 1 
Positive prediction value 1 
Negative prediction value 1 
Prevalence 0.3005 
Detection rate 0.3005 
Detection prevalence 0.3005 
Balanced accuracy 1 
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Summary 


1. Forensic population genetic study aims to investigate the variations of gene fre- 
quencies and genotype frequencies in different populations by mathematical and sta- 
tistical methods; to study the genetic differentiations within and between populations; 
to analyze the genetic structures of populations; to survey their genetic backgrounds; 
and eventually to lay the foundation for forensic applications. 

2. Before each genetic marker or multiplex panel is applied to forensic practice, it is 
necessary to systematically evaluate its genetic polymorphism and population applica- 
bility, which aims to develop reasonable analytical method for forensic application 
based on the result of population study. 

3. As the currently mainstream molecular markers for population study on the NGS 
platform, STRs, SNPs, mtDNA, MHs, and compound markers (i.e., SNP-STR, 
InDel-SNP, InDel-STR) can play an important role in the applications of forensic 
individual identification, kinship analysis, ancestral information inference, and so on. 

4. Estimation of Fs, F test, MDS analysis, PCA, phylogenetic tree reconstruction, and 
genetic admixture predictive model are the main analytical approaches for forensic 
population genetic research. 

5. Based on the NGS platform, a combination of 40 AI-SNPs is studied, which can be 
used for forensic ancestral information inference of the studied 17 populations from 
the three continents of EUR, AFR, and EAS. 
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Introduction 


Short tandem repeats (STRs) are regions of DNA that repeat a number of times, and that 
number can vary. These repeats typically contain between two and six base pairs (Wyner 
et al., 2020). Autosomal STRs are found in the 22 pairs of autosomal chromosomes that 
humans have. These chromosomes do not determine a person’s sex (What is STR 
Analysis? National Institute of Justice, 2011). Because the number of repeats STRs 
have is variable, they are often used in forensics to distinguish individuals (Wyner et 
al., 2020). Several kits are commercially produced to amplify autosomal STRs as well 
as STRs found on both the X and Y chromosomes using both next-generation 
sequencing (NGS) and capillary electrophoresis (CE). NGS provides a more detailed 
way of analyzing the STRs because it can determine both the length and sequence of 
the loci instead of only the length determined by CE. One way of analyzing the 
STRs is by performing population studies that show variability in sequence and length 
by population. Using NGS methods, the sequences of both flanking regions and the 
repeat region of STR loci can be determined, and certain sequences can be more prev- 
alent in certain populations (Kwon et al., 2021). 


Next-generation sequencing 


NGS, otherwise known as massively parallel sequencing (MPS), is a method of deter- 
mining the DNA sequence(s) found in a sample. NGS is a high-throughput technology 
that allows for the sequences of multiple DNA templates to be read at the same time, in 
parallel. Whole genome sequencing can be performed using NGS technologies to 
amplify and sequence the entire genome. Targeted sequencing can also be completed 
to determine the sequence of specific regions on the DNA template. Multiple methods 
of NGS exist, including sequencing-by-synthesis, pyrosequencing, and ion detection 
(Bruyjns et al., 2018). 

Sequencing-by-synthesis using reversible terminators, created by Illumina, begins 
with adapters being ligated to the template DNA, and those adapters will bind to com- 
plementary oligonucleotides found on the flow cell. Adapters on both ends of the DNA 
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strand will bind to the complementary sequences, forming a bridge. Bridge amplification 
then occurs, and clusters of the same DNA sequence form in the same area, with multiple 
clusters of different DNA sequences forming. Once denatured, dideoxynucleotides with 
fluorescent labels are added to complement the DNA strand. Each type of ddNTP has its 
own fluorescent label, so as the bases are added, the unique fluorescence is detected 
(Bruyns et al., 2018). This allows the sequence of the DNA strand to be determined. 

Pyrosequencing, used in the Roche 454 technology, is another sequencing-by-syn- 
thesis method. As a complementary base is incorporated into the complementary DNA 
strand, a conversion from pyrophosphate to adenosine triphosphate (ATP) occurs. This 
ATP drives the conversion to oxyluciferin after the addition of luciferin. When this oc- 
curs, light is emitted. So, when the complementary nucleotide is added to the strand be- 
ing synthesized, light will be detected (Bruijns et al., 2018). The sequence of the template 
DNA strand can be determined by adding only one dNTP at a time. 

Ion detection, introduced by Ion Torrent, detects changes in pH instead of light and 
fluorescence like the previous two methods described. When pyrophosphate is cleaved 
after the addition of a nucleotide to the DNA strand, a proton is released. This causes 
a voltage change. If one specific nucleotide is added to the DNA being sequenced and 
the voltage changes, then that nucleotide is incorporated (Bruijns et al., 2018). As 
different nucleotides are added to the reaction, these can be monitored to figure out 
the sequence of the DNA. These NGS technologies, along with others, have been 
applied in a variety of ways, one of which is the analysis of autosomal STRs. 


Application of NGS in autosomal STR analysis 


NGS is a newer method for the analysis of STRs and SNPs that can be utilized in the field 
of forensic science. Using STR profiles obtained from NGS methods, individuals can be 
identified, whether they are victims or suspects. Samples received by technicians and an- 
alysts for forensic analysis are often small and contain low-quality and/or low-quantity 
DNA. NGS has been shown to be suitable for the analysis of STRs on DNA that is 
degraded or found in low quantities (Bruijns et al., 2018). The ability of this technology 
to be used to amplify short fragments is an advantage over the traditionally used CE 
method. 

NGS has been used to help identify victims of World War II after the discovery of 
skeletal remains in the Babna Gora municipality in Slovenia. Using the Precision ID 
GlobalFiler NGS STR Panel, which includes 31 autosomal STRs, genotyping was 
only successful for one out of three incomplete skeletons. To attempt to identify the in- 
dividual, samples were also taken from the presumed niece and nephew of the individual. 
After analysis, the posterior probability was calculated to be 99.99986%, meaning that the 
victim sample was most likely related to the presumed relatives’ samples rather than 
related to an unrelated individual (Pajnic et al., 2019). So, NGS can be useful for the 
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identification of victims by comparing victim samples to family samples and completing 
the analysis of autosomal STRs. 

A challenge for identifications using STR profiles obtained after CE is in cases 
involving monozygotic or identical twins. After CE, the STR profiles of identical twins 
are the same, so the two individuals cannot be differentiated. This can be problematic 
when working with forensic evidence from crime scenes or determining if an individual 
is the father of a child for paternity testing. To be able to differentiate between identical 
twins, sequencing DNA and locating mutations are necessary. Weber-Lehmann et al. 
used Illumina’s technology to locate single nucleotide polymorphisms of two identical 
twins and the child of one identical twin. They were able to locate many sites with single 
base pair polymorphisms, some including sites on chromosomes 4, 6, and 14, that were 
different between the two twins and the same in the parent/child profiles (Weber- 
Lehmannn et al., 2014). 

In addition to being able to differentiate between identical twins, NGS has also been 
shown to differentiate the genetic profiles of nonrelated individuals in samples with mixed 
DNA contributed by multiple people. Mixed samples are often collected in forensic cases, 
containing mixtures of suspects’ and victims’ DNA. Momota et al. (2021) were able to 
determine whether using NGS would be a beneficial tool when analyzing samples 
with more than one DNA contributor. They created different ratios of two-person 
muxtures. After performing NGS, they found that differences in the STR sequences could 
be determined up to a mixing ratio of 1:5 (Momota et al., 2021). This indicates that 
autosomal STR profiles can be separated out for two-person mixtures, which can be 
useful in forensic contexts when evidence has multiple DNA contributors. 

As the utility of NGS in forensics becomes more apparent, developmental validations 
are being completed for the instruments and software programs that analyze STRs as well 
as SNPs using NGS. Developmental validations are necessary for instruments, software, 
and reagents needed for forensic analysis as they determine the reliability of the test. 
Many elements are included within a developmental validation, including studies on 
reproducibility, accuracy, and sensitivity, among others (National Commission on 
Forensic Science: Views of the Commission: Validation of Forensic Science Methodol- 
ogy National Institute of Standards and Technology, 2/29/16). One such validation was 
completed for the MiSeq FGx Forensic Genomics System. The ForenSeq DNA Signa- 
ture Prep Kit, included in Table 1, is part of this system. Sensitivity studies test dilutions of 
DNA to determine the percentage of allele calls made with varying amounts of DNA. 
Reproducibility studies determine whether the method produces the same results for 
multiple analysts completing the same method. Accuracy studies establish whether the 
results of the analysis are correct. These tests were all performed with the MiSeq FGx 
Forensic Genomics System. The group found that for the sensitivity study, a range of be- 
tween 1 ng and 62.5 pg resulted in no allele loss. The reproducibility test, completed by 
five analysts, was used to determine accuracy in addition to precision. The results showed 
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that for primer set A, there was 100% accuracy and 99.89% precision. For primer set B, 
the results were 100% for both accuracy and precision (Jager et al., 2017). These findings 
show that the MiSeq FGx Forensic Genomics System, utilizing NGS, can be reliable for 
forensic analysis of autosomal STRs. 

When analyzing STRs, traditional methods involve the use of CE to develop the 
STR profiles. As NGS is being introduced into more laboratories, it has been demon- 
strated that NGS can improve information gained from profiles when compared to 
CE. These two methods were compared in a study by Revoir et al. (2019). Fifty- 
three DNA samples were used and amplified using the DNA Signature Prep Kit for 
NGS and the Promega PowerPlex Fusion 6C kit for CE. These two kits have 26 targeted 
loci in common. After gathering the results, there was 99.3% concordance between the 
NGS kit and the CE kit initially. When the results in the Universal Analysis Software for 
the MiSeq were reviewed, the concordance percentage went up to 99.8%. This was due 
to calls made subthreshold. For the remaining disagreements between the two kits, they 
were found to be because of missing allelic data at different loci. They determined that 
this was possibly due to inefficient amplification or dropout, particularly at the Penta E 
loci. In addition to determining concordance between the methods, they also looked at 
how allele counts changed. The change in allele count is due to NGS being able to distin- 
guish allele sequences instead of allele determination by size using CE. Seventeen out of 
twenty-three loci increased in the number of alleles when determining alleles by 
sequence. The largest increase in allele number was exhibited by the D12S391 locus, 
which gained an additional 20 alleles (Revoir et al., 2019). This study demonstrates 
the ability of NGS to be a beneficial method for the analysis of STRs. The concordance 
of allele calls between the NGS and CE was greater than 99%; however, the NGS pro- 
vides a way of getting sequence information that can distinguish different alleles by more 
than just size. 


Comparing loci amplified using different kits for NGS applications, 
including two kits for CE applications 


Table 9.1 shows the list of autosomal STR loci that are targeted for the ForenSeq DNA 
Signature Prep Kit, Precision ID GlobalFiler NGS STR Panel v2, PowerSeq 46GY Sys- 
tem, Investigator 24Plex QS Kit, and the PowerPlex Fusion System. The first three kits 
are used for NGS applications, while the last two kits are used for analysis using CE. Based 
on the number of autosomal STR loci targeted for each kit, the kits for NGS applications 
detect the most loci when compared to those for use with CE. Loci such as D4S2408 and 
D6S1043 are included in more than one kit for NGS but not in the kits for CE. Only one 
locus, SE33, is included in the MainstAY SE available-for-purchase NGS-based multi- 
plex kit discussed here. The SE33 locus is included when amplifying DNA using Qia- 
gen’s Investigator 24Plex kit. 


Table 9.1 Loci Amplified for 4 NGS kits and 2 CE kits. 


Verogen Verogen Precision ID 

ForenSeq ForenSeq DNA GlobalFiler NGS Promega Qiagen Promega 
MainstAY SE kit Signature Prep STR panel v2 PowerSeq 46GY Investigator PowerPlex 
(Forensic kit (ForenSeq (Precision ID system 24Plex QS Fusion system 
Mainstay Kit DNA Signature GlobalFilerTM (PowerSeq 4GY (Investigator (PowerPlex 
Reference Guide) Prep Reference) NGS STR, 1612) system Technical 24plex QS Fusion system 
(NGS) (NGS) (NGS) Manual) (NGS) Handbook) (CE) for use) (CE) 


D1S1656 
TPOX 
D2S441 
D2S1338 
D3S1358 
D4S82408 
FGA 
D5S818 
CSF1PO 
D6S1043 
D7S820 
D8S1179 
D9S1122 
D10S1248 
THO1 
vWA 
D12S8391 
D13S8317 
PentaE 
D16S539 
D17S1301 
D18S851 
D19S433 
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Table 9.1 Loci Amplified for 4 NGS kits and 2 CE kits——cont’d 


Verogen Verogen Precision ID 
ForenSeq ForenSeq DNA GlobalFiler NGS Promega Qiagen Promega 
MainstAY SE kit Signature Prep STR panel v2 PowerSeq 46GY Investigator PowerPlex 
(Forensic kit (ForenSeq (Precision ID system 24Plex QS Fusion system 
Mainstay Kit DNA Signature GlobalFiler™TM (PowerSeq 4GY (Investigator (PowerPlex 
Reference Guide) Prep Reference) NGS STR, 1612) system Technical 24plex QS Fusion system 
(NGS) (NGS) (NGS) Manual) (NGS) Handbook) (CE) for use) (CE) 

D20S482 

D21S11 J 

PentaD V 

D2281045 Vv 

D1S1677 J 

D2S1776 J 

D3S4529 J 

D5$2800 Vv 

D6S474 J 

D12ATA63 J 

D1481434 J 


SE33 
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When comparing the NGS STR kits with each other, the Promega PowerPlex 46GY 
System amplifies the least number of loci, totaling 22. This kit includes all twenty of the 
core Combined DNA Index System (CODIS) loci, as well as those included in the Eu- 
ropean Standard Set (ESS) (Switching to 20 Core CODIS, 1210; Core STR Loci Used 
in). Verogen’s ForenSeq DNA Signature Prep Kit amplifies a total of 27 loci. The loci 
this kit targets also include the expanded set of CODIS loci and the ESS loci. Verogen’s 
MainstAY SE kit, which is in the beta testing stage as of June 2022, includes all of the loci 
amplified in the ForenSeq DNA Signature Prep Kit as well as SE33. The Precision ID 
GlobalFiler NGS STR Panel v2 amplifies the most loci when compared to the other 
two kits, amplifying 31 different autosomal STRs. CODIS and ESS loci are targeted 
with this kit as well. 


Population studies of autosomal STRs using NGS 


Different population groups in the world can have differences in allele frequencies for 
particular STRs as well as variance in the DNA sequence of the STRs that have been 
amplified. Kwon et al. (2021) used an in-house panel for NGS and analyzed samples ob- 
tained from four groups: African American, Caucasian, Hispanic, and Korean. A total of 
25 autosomal STRs were analyzed by this group. One STR that was highlighted was 
SE33, a repeat that is highly polymorphic in terms of both length and sequence. It 
was analyzed due to a lack of data on its sequence polymorphisms when compared to 
other STR markers. After creating the MPS panel, MPS was completed using a PCR- 
based method for creating the library. When analyzing SE33, which is characterized 
by [CTTT] repeats, however, differences in the repeats were noted. Isoalleles were 
also identified and counted for both repeat and flanking regions. Allele frequencies 
were calculated based on the sequences in the 5’ and 3’ flanking regions of SE33. For 
the African American and Hispanic samples that were tested, the highest percentage of 
alleles observed had [TTTC] preceding the repeat region, while for Caucasians and Ko- 
reans, [CTTC] was the most observed sequence at that location. The most SE33 
sequence variations were exhibited in the African American population, with nine 
different sequence variations, while the least number of variations was found in the His- 
panic population. The Hispanic population only exhibited three variations. Seven total 
positions of polymorphisms were found in SE33 (Kwon et al., 2021). 

The SE33 locus was also able to be analyzed using the Verogen ForenSeq MainstAY 
SE kit, which was being beta-tested at Towson University. The researchers there tested 
samples from Caucasian and African American populations. They found, of the three 
Caucasian samples tested, all had isoalleles at the SE33 locus. In one sample, however, 
the isoallele was detected below threshold. When compared to the African American 
population, the number of samples with isoalleles observed dropped down to 4 out of 
19 tested samples (Noelle Neff, unpublished results). 
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Population studies and the data they generate allow for statistical calculations to be 
performed in forensic cases. These statistics determine the likelihood that an individual 
is a contributor to DNA found on a piece of evidence. One group, Gettings et al. 
(2018), sought to increase the amount of available allele frequency data for different pop- 
ulation groups in order to determine likelihood statistics. They targeted 27 different auto- 
somal loci, and the frequencies are reported with high confidence. African American, 
Caucasian, Asian, and Hispanic population groups were represented in the samples. 
Some of the STR loci targeted include D12S8391, D18S51, and D4S2408. The analyses 
were performed using two different software types: Universal Analysis Software by Illu- 
muna and a pipeline derived from STRait Razor 2.0. After the analyses were completed, 
the group found that 16/27 loci had sequence polymorphisms in their flanking regions, 
totaling 40 polymorphisms. The majority of these variations were single-nucleotide 
polymorphisms, but insertions and deletions were also found. Sequence motifs were 
determined, and the frequency within a particular population was calculated for the mo- 
tifs. These frequencies varied by population and motif. For example, the data showed that 
at the D12S391 loci, the [AGAT]n [AGAC]n AGAT motif most frequently occurred in 
Asian populations, with a 90.72% allele frequency in that population. In the Caucasian 
population, this number was lower, with a 66.07% frequency. At the D13S317 loci, 
the [TATC]n motif was most common in the African American population at 
69.74%. The same motif had a frequency of 41.75% in the Asian population (Gettings 
et al., 2018). Reporting of these population frequencies helps with statistical calculations 
used in forensics to determine the likelihood of a person contributing DNA to a sample. 

Dai et al. (2019) conducted a population study for the Chinese Han people. Analysis 
of the STRs obtained from DNA in blood samples was completed after NGS. When 
compared with the number of alleles for a particular locus using CE methods, the number 
of alleles after NGS increased. This change was due to single nucleotide polymorphisms 
being recognized in the repeat or flanking region of the STR and differences in the struc- 
ture of the repeat. Differences in the structure of the repeat were noted for the seven 
compound loci that were amplified. At least two times the number of alleles at several 
loci identified using CE were identified using NGS. These loci include D2S1338, 
D13S317, D2S11, and vWA, among others. Using next-generation sequencing, this 
group was able to determine the number of alleles for each amplified locus in the Chinese 
Han community (Dai et al., 2019). A similar result was found for other population groups 
in another research study. 

Illumina’s sequencing technology was used by Gettings et al. (2016) when researching 
variations in the DNA sequences of autosomal STRs. Twenty-two loci were analyzed in 
this study. DNA samples were used to represent people in the African American, Cauca- 
sian, and Hispanic populations. The group found that for 15 of the loci, there was an in- 
crease in the number of alleles based on sequence when compared to alleles based on 
length using NGS. This shows that NGS has a greater ability to discriminate between 
different alleles than by using CE methods. This is due to its capability to generate infor- 
mation at the sequence level of the DNA. In addition, they were also able to calculate the 
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probability of identity (PI) for each population group. In general, the PI decreased when 
calculated based on sequence instead of length. This value varied for each population, 
though. For example, at the D12S391 loci, the PI was 0.020 for African Americans, 
0.019 for Europeans, and 0.028 for Hispanics when calculating PI by sequence. This 
PI value is the probability that, at a particular locus, two people within the same popu- 
lation have the same genotype (Gettings et al., 2016). 

The D21S11 locus was analyzed by Rockenbauer et al. (2014) using samples ob- 
tained from 29 paternity cases in Denmark. Not all of the individuals were related; sam- 
ples used from five of the cases were from unrelated individuals. Sequencing was 
performed using Roche Diagnostics’ GS Junior System, which utilizes the pyrosequenc- 
ing method. After running CE on the samples, it was indicated that at D21S11, a muta- 
tion might have occurred between a parent and the child in 54 confirmed father/ 
mother/child-related samples. NGS was able to suggest which of the alleles coming 
from the parent had been mutated. The mutations observed in the locus alleles were 
either single or multiple repeat units being inserted or deleted. After sequencing the un- 
related individuals’ DNA, 20 alleles for the D21S11 locus were recognized, while only 13 
were identified after CE. Differences in sequences for the same allele call were contrib- 
uted to by variations in the number of subrepeats (Rockenbauer et al., 2014). So, while 
the length of the alleles may be the same in a case sample, the actual sequence may be 
different after looking at NGS data. 


Conclusion 


NGS is a newer technology that is able to analyze STRs, including autosomal STRs. 
NGS can determine the DNA sequences as well as the length of the STR loci. Since 
the sequence can be discovered, this has an advantage over the STR analysis capabilities 
of CE, which can distinguish STRs by size. Many population studies have been per- 
formed to calculate population allele frequencies. These values are used in forensics to 
determine the likelihood that a person’s DNA matches that found on evidence as that 
it came from a different person. Analysis of autosomal STRs using NGS has also been 
shown to be useful for separating DNA mixtures as well as detecting differences between 
monozygotic twins. NGS has many applications for the analysis of autosomal STRs; 
however, its usefulness expands beyond autosomal STRs. 
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Introduction to DNA analysis methods 


The conventional method of individualizing suspects in most forensic investigations in- 
volves short tandem repeat (STR) typing with capillary electrophoresis (CE). Although 
STR typing is an effective method for identifying individuals, it is not capable of differ- 
entiating monozygotic twins. Monozygotic twins come from the same fertilized eggs. As 
a result, they have identical genomes and identical STR allele length profiles. Therefore, 
current research has started to focus on methods that can be used to discriminate between 
monozygotic twins. Multiple different methods to discriminate between monozygotic 
twins have been attempted, including sequencing the mitochondrial genome, 
sequencing the nuclear genome, microRNA (miRNA) profiling, and comparing 
DNA methylation, or the methylome. 

The new application of next-generation sequencing (NGS) has allowed for this 
expansion in the possibilities of differentiating monozygotic twins. The high throughput 
of NGS allows for a large portion of DNA to be sequenced in a shorter amount of time. 
For mitochondrial genome sequencing, the entire chromosome can be sequenced as 
opposed to just the hypervariable regions. Sequencing the entire mitochondrial genome 
allows more differences within the genome to be located, which is useful for differenti- 
ating monozygotic twins. For analyzing the mitochondrial and nuclear genomes, the 
flanking regions can be important to identify sequence variations. Sequence variations 
have been found in flanking regions, which can be an important area of study for mono- 
zygotic twins because sequence variations can be few and difficult to locate (Fig. 10.1). 


Mitochondrial genome sequencing 


Recent research has focused on detecting differences in monozygotic twins by using 
NGS to sequence the mitochondrial genome. One reason for the interest in the mito- 
chondrial genome is that it has fewer DNA repair mechanisms than the nuclear genome, 
which leads to a higher mutation rate and substitution rate (Wang et al., 2015). These 
characteristics make it more likely that there will be differences between monozygotic 
twins. In addition, it has a higher copy number and a smaller genome size, which 


Next Generation Sequencing (NGS) Technology in DNA Analysis © 2024 Elsevier Inc. 
ISBN 978-0-323-99144-5, https://doi.org/10.1016/B978-0-323-99144-5.00010-X All rights reserved. 


185 


186 


Next Generation Sequencing (NGS) Technology in DNA Analysis 


Figure 10.1 Methods for differentiating monozygotic twins. 


make it easier to sequence (Yuan et al., 2020). The mitochondrial genomes of six pairs of 
monozygotic twins were sequenced from blood samples and compared to determine if 
any heteroplasmy or variation between the monozygotic twins could be determined 
(Wang et al., 2015). A point heteroplasmy is when there is more than one nucleotide 
at one position (Wang et al., 2015). The six pairs of monozygotic twins provided blood 
samples, which were sequenced on the Illumina HiSeq 2000 sequencing system. After 
sequencing, 11-point heteroplasmies were found (Wang et al., 2015). Three of the het- 
eroplasmic events were in the HV1 or HV2 region, and eight were in coding regions 
(Wang et al., 2015). The 11 heteroplasmies had varying levels of heterogeneity in all 
of the monozygotic twin pairs except for one. In addition, there was a single nucleotide 
variant (nt15301) detected in four out of the six pairs of monozygotic twins (Wang et al., 
2015). After comparing the mitochondrial genome in the twin pairs, all of the twin pairs 
were differentiated by either one of the 11 heteroplasmies found or the nucleotide variant 
in the mitochondrial genome (Wang et al., 2015). 

Sequencing the mitochondrial genome has also been applied to a study of three rape 
cases and a murderous adultery case that were linked because the STR profiles were iden- 
tical and both monozygotic twin’s STR profiles were consistent with the DNA found at 
the scenes of the crimes (Yuan et al., 2020). Since two individuals had STR profiles 
consistent with the evidence, a twin could not be charged until a difference was found. 
To discriminate between the twins, whole-genome sequencing (WGS) was performed 
(Yuan et al., 2020). Blood samples from both of the twins were obtained, and the 
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mitochondrial genomes were sequenced. After sequencing, five single nucleotide poly- 
morphisms (SNPs) were found (Yuan et al., 2020). Amplification refractory mutation 
system polymerase chain reaction (ARMS-PCR) was used to confirm the SNPs with 
four primers. After the ARMS-PCR, one somatic mutation in the mitochondrial 
genome, m.6903 T > C was confirmed (Yuan et al., 2020). There were two different 
bases at this locus: 2.6% was cytosine and 97.4% was thymine (Yuan et al., 2020). This 
somatic mutation was found in the nail evidence sample of one twin but not the other. 
This somatic mutation was used to distinguish the monozygotic twins. The DNA evi- 
dence found in both the first and second cases contained the same somatic mutation, 
so one twin could be identified. In addition, the other twin had two heteroplasmic 
loci at m.6935C > T and m.6938C > T in the semen samples collected, which were 
not found in the crime scene evidence samples. Therefore, that twin could be excluded 
from contributing to the DNA evidence (Yuan et al., 2020). These three differences in 
the mitochondrial genome were used to differentiate the monozygotic twins. 


Nuclear genome sequencing 


Along with mutations and heteroplasmies in the mitochondrial genome, SNPs have also 
been found in the nuclear genome. These are particularly useful in a paternity test 
involving potential fathers who are monozygotic twins because the SNPs in the nuclear 
genome are inherited by offspring. Although these polymorphisms are useful, they are 
extremely rare because the mutation must occur after the splitting of the zygote but as 
early in development as possible so that it is present in multiple cell types (Weber- 
Lehmann et al., 2014). Ifthe mutation occurs after the splitting of the zygote and before 
the separation of the germ layers, it will be present in multiple cell types. If the mutation 
occurs later in development, it may only be present in sperm cells or other specific cell 
types. In a study, a paternity test was performed on twin brothers and a child by probing 
SNPs that arose due to mutations (Weber-Lehmann et al., 2014). Sperm, blood, and mu- 
cosa samples were collected from the twin brothers, and blood samples were collected 
from the child and the mother. Shotgun libraries were prepared and sequenced with 
an Illumina HiSeq 2000. After sequencing, 12 SNPs were found in only one twin and 
the child, but seven were eliminated (Weber-Lehmann et al., 2014). The remaining 
five SNPs were confirmed by Sanger sequencing. The five SNPs were located on chro- 
mosomes 4, 6, 11, 14, and 15. Four of these SNPs in the blood samples were also found in 
the buccal mucosa samples (Weber-Lehmann et al., 2014). These SNPs were used to 
determine the paternity of the child. 

Another feature of the nuclear genome that has been researched for discriminating 
monozygotic twins is rapidly mutating Y-STRs. Only one research study was found 
to investigate if rapidly mutating Y-STRs can be used to differentiate monozygotic twins. 
Samples were collected from 14 pairs of Caucasian-Lebanese male monozygotic twins. 
For genotyping, the Identifiler Plus and 27-plexY Filer Plus kits were used. The Y- 
STRs included in the kit were DYS19, DYS385a/b, DYS389I/II, DYS390, DYS391, 
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DYS392, DYS393, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458, 
DYS635, GATA H4, DYS460, DYS481, DYS533, DYS387a/b, DYS449, DYS518, 
DYS570, DYS576, and DYS627 (Romanos & Borjac, 2018). The loci DYS387a/b, 
DYS449, DYS518, DYS570, DYS576, and DYS627 were the rapidly mutating Y- 
STRs (Romanos and Borjac, 2018). The mutation rate of all of the rapidly mutating 
Y-STRs was over 1 xX 1077 (Romanos & Borjac, 2018). The samples were extracted, 
quantified, amplified, and genotyped on the ABI 3500 Genetic Analyzer to confirm 
monozygotsity (Romanos & Borjac, 2018). The samples were re-amplified with the 
Y-filer Plus kit and genotyped. After genotyping, it was found that the Y-Filer Plus 
kit was unable to differentiate any of the pairs of monozygotic twins. Even though the 
results were not significant, a larger sample size and the use of NGS could aid in discov- 
ering differences in rapidly mutating Y-STRs between monozygotic twins. 


RNA profiling: microRNAs 


RNA profiling with miRNAs has also been used to discriminate monozygotic twins. 
miRNAs are small noncoding RNAs that help posttranscriptional gene regulation by 
binding to mRNAs (O’Brien et al., 2018). When miRNAs bind to mRNAs, they desta- 
bilize the mRNA and cause the protein not to be translated. As a result, miRNAs impact 
the level of expression a specific protein has without changing the gene sequence 
(O’Brien et al., 2018). miRNAs are small, stable, and easy to detect, which are all bene- 
ficial for forensic science. To determine if miRNAs are useful biomarkers for discrimi- 
nating monozygotic twins, blood samples were collected from four pairs of 
monozygotic twins (Fang et al., 2019). Then the RNA was isolated, quantified, and 
sequenced on an Illumina HiSeq 2500. Once the miRNA profile was generated, it 
was found that 142—176 miRNAs were found in each of the samples (Fang et al., 
2019). There was no significant difference found in the miRNAs present between the 
monozygotic twin pairs (Fang et al., 2019). As a result, the expression of the miRNAs 
was assessed for differences. In the first twin pair, 14 miRNAs were expressed at different 
levels (Fang et al., 2019). The second pair had 26 differently expressed miRNAs, the third 
pair had 10, and the fourth pair had 41. Only one miRNA, miR-451a, showed a signif- 
icantly different level of expression in all of the twin pairs (Fang et al., 2019). Future 
research is required to determine if that miR NA is useful for the discrimination of mono- 
zygotic twins. 


High-resolution melting to determine DNA methylation 


Due to the fact that it is so rare and difficult to find differences in the nuclear and 
mitochondrial genomes, multiple researchers have turned to investigating the epige- 
nome. The epigenome involves changes in DNA that occur because of environmental 
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events. One layer of the epigenome is DNA methylation. DNA methylation is the 
result of DNA methyltransferase adding a methyl group to the C5 position of cyto- 
sine. The methylation regulates gene expression and can be found in multiple loca- 
tions, including CpG islands in the promoter region. To detect DNA methylation, 
the extracted DNA is treated with bisulfite in a reaction. Once treated with bisulfite, 
the unmethylated cytosine is converted into uracil, which is then converted into 
thymine during PCR. Methylated cytosine does not undergo bisulfite conversion 
and remains cytosine. There are two methods to detect the change from cytosine 
to thymine. One method is post-PCR high-resolution melt (PCR HRM). Re- 
searchers have observed differences in methylation patterns between monozygotic 
twins in Alu-E2F3 and Alu-SP (Stewart et al., 2015). These regions were bisulfite 
converted, and the melting temperatures were compared between the buccal swabs 
from five sets of twins. The more cytosine or higher methylation, the higher the 
melting temperature because cytosine-guanine is bonded with three Watson-Crick 
hydrogen bonds and thymine-adenine is held together by two Watson-Crick 
hydrogen bonds. The melting temperatures can be compared, and if the melting tem- 
peratures are different, it is concluded that the monozygotic twins have different 
methylation patterns in Alu-E2F3 or Alu-SP. After comparing the melting tempera- 
tures in a recent study, it was found that all five sets of twins had different methylation 
patterns in Alu-E2F3, and four sets of twins could be differentiated with Alu-SP 
(Stewart et al., 2015). This method is less time-consuming and less costly than 
sequencing methods (Stewart et al., 2015). Although it is more cost-efficient, it has 
a lower resolution and requires a larger amount of DNA to achieve adequate results 
due to its lower multiplexing capacity. 


Detecting DNA methylation with pyrosequencing 


Another method for differentiating monozygotic twins is to locate differently meth- 
ylated regions (DMRs) with pyrosequencing. Pyrosequencing is useful for sequencing 
methylation sites or CpG sites in monozygotic twins in specific regions of the 
genome. One study investigated the methylation in LINE-1 (long interspersed 
element 1) (Xu et al., 2015). LINE-1 is a repetitive DNA retrotransposon (Ardeljan 
et al., 2017). This repetitive element makes up 17% of the human genome (Xu 
et al., 2015). Since it is a large portion of the genome, the methylation level of 
LINE-1 is used to infer the global DNA methylation within an individual (Xu 
et al., 2015). In addition, LINE-1 is made up of repeat sequences that have low selec- 
tive pressure and greater variation, making the locus a good candidate for differenti- 
ating monozygotic twins (Xu et al., 2015). To analyze if the LINE-1 methylation 
level is different between monozygotic twins, buccal and blood samples from 119 
pairs of monozygotic twins and 57 dizygotic twins were collected (Xu et al., 2015). 
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After sample collection and extraction, the samples were bisulfite-treated, amplified, 
and pyrosequenced at the promoter region of LINE-1 (Xu et al., 2015). The methyl- 
ation level of the promoter region of LINE-1 was compared in the monozygotic 
twins, and t-tests were performed to determine if there was a statistically significant 
difference in the methylation level (Xu et al., 2015). The scientists observed that 
15 out of 119 pairs of monozygotic twins were differentiated, and 10 out of 57 dizy- 
gotic twins were differentiated (Xu et al., 2015). There was correlation found with 
age in only the buccal samples methylation level. In addition, the sample type and 
sex of the individual were also found to impact the methylation level (Xu et al., 
2015). It was found that the methylation level of the buccal samples was significantly 
lower than the methylation level in blood samples (Xu et al., 2015). Additionally, fe- 
males had lower mean methylation values in blood samples but no difference in buccal 
samples (Xu et al., 2015). 


Genome-wide DNA methylation profiling 


Another approach to differentiating monozygotic twins is with NGS and genome-wide 
DNA methylation profiling with Illumina’s BeadChip arrays. The Illumina Human 
Methylation 450k BeadChip array sequences over 450,000 CpG sites and the Illumina 
Infintum Human Methylation27 Beadchip kit sequences over 27,000 CpG sites (Li 
et al., 2013). Multiple studies have assessed the methylation pattern in monozygotic 
twins and found the most promising CpG sites for distinguishing monozygotic twins. 
One study analyzed blood samples from 22 pairs of monozygotic twins and the Illumina 
Infintum Human Methylation27 BeadChip assay to sequence thousands of CpG sites 
(Li et al., 2013). After sequencing, 92 CpG sites showed significant differences in 
methylation between the monozygotic twins (Li et al., 2013). These CpG sites were 
spread across 20 different chromosomes and were found to be related to the function 
of peripheral blood leukocytes (Li et al., 2013). These 92 CpG sites allowed for all 
22 pairs of monozygotic twins to be distinguished (Li et al., 2013). 

Another study also involved screening the blood samples of monozygotic twins to 
determine the main DMRs. Four pairs of monozygotic twins were screened and 
sequenced. After sequencing, 1772—3766 differently methylated regions were identified 
in each pair of monozygotic twins (Du et al., 2015). To narrow down the differently 
methylated regions to achieve the best results for differentiating monozygotic twins, 
criteria were set. The criteria included that the DMRs had to be present in all four 
sets of monozygotic twins and reach a defined level of methylation difference (Du 
et al., 2015). The methylation difference was calculated by mapping the reads, scoring 
the methylation, normalizing all of the scores, calculating the MeDIP scores, and then 
calculating the fold change between the MeDIP score of the monozygotic twins (Du 
et al., 2015). The fold change had to be greater than 830, with the MeDIP score in 
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one twin being zero while the other was larger than 8.31 to fit the criteria for DMR (Du 
etal., 2015). After applying these criteria, 38 DMRs were selected (Du et al., 2015). All of 
the DMRss that fit the criteria were in CpG islands. The DMR-associated genes were 
found to be involved in cell growth, differentiation, proliferation, and development 
(Du et al., 2015). One gene was a zinc finger protein-related gene. Zinc finger proteins 
are important in early development, and so it was hypothesized that the DMRs associated 
with zinc finger proteins are useful for discriminating younger twins (Du et al., 2015). 
This study shows that all four pairs of monozygotic twins could be differentiated with 
38 DMRs, but the sample size was small, so further investigation on the 38 DMRss is 
needed to confirm that they are capable of differentiating all types of monozygotic twins 
(Du et al., 2015). 

In addition to the above studies, buccal swabs, cigarette butts, saliva, and used ciga- 
rette buttss have been tested by genome-wide DNA methylation analysis. In a study, 
blood, buccal swabs, cigarette butt, and saliva samples were provided by one pair of 
52.5-year-old European female monozygotic twins (Vidaki et al., 2018). The buccal 
cell samples were processed, extracted, quantified, and used to generate a genome- 
wide DNA methylation profile with the Illumina Infintum Human Methylation 
450k BeadChip array. Each site was given a beta value, where 0 was assigned for 
completely unmethylated at the CpG site and 1 was assigned for completely methyl- 
ated (Vidaki et al., 2018). The data generated was normalized, and DMRs were elim- 
inated if the probes were mapped to multiple locations and if the marker had a beta 
value higher than 0.85 or lower than 0.15(Vidaki et al., 2018). In addition, the differ- 
ence between the beta values had to be greater than 0.4. After considering these con- 
ditions, 129 candidate DMRs were found in buccal cells (Vidaki et al., 2018). For 
assay development, 22 of the top DMRs (over 0.4 difference) were selected (Vidaki 
et al., 2018). These sites were sequenced to determine methylation in saliva and ciga- 
rette butts. The saliva and cigarette butts had lower or opposite methylation differ- 
ences in some of the 22 DMRs when compared to the buccal cell DNA. 
Therefore, the twins could not be differentiated with those sample types (Vidaki 
et al., 2018). Even though a difference in methylation in the twins could not be ob- 
tained with the saliva and cigarette butt samples, a difference was still obtained from 
the DNA extracted from the buccal swabs (Vidaki et al., 2018). This study expanded 
on previous studies, which only found DMRs in one sample type. Different cell types 
perform different functions than other cell types. Therefore, different genes will be 
methylated because different genes need to be downregulated or turned off. Buccal 
swabs, cigarette butts, and saliva are commonly found in casework. Therefore, finding 
DMBsss in these sample types would be useful so that case evidence can be investigated 
no matter the sample type. 

In another study, blood samples were collected from 12 pairs of monozygotic twins 
(Park et al., 2017). After sequencing with the [lumina Human Methylation 450k array, 
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over 450,000 CpG sites were narrowed down to 6 CpG sites (Park et al., 2017). The sites 
were narrowed down to the CpG sites that have the largest differences in methylation 
among the 12 pairs of twins (Park et al., 2017). The six CpG sites were cg00211609 
(body of FAM132A), cg26287080 (body of EXOC7), cg01558909 (promoter of 
HBM), cg21036194 (body of SNCAIP), cg01419577 (intergenic regions), and 
cg04620228 (body of UTP14A) (Park et al., 2017). These six CpG sites were shown 
to have a high frequency of differences in methylation status, but further research was 
needed on these specific six sites to determine if additional twin pairs also have a high 
frequency of difference in methylation. 

Using the sites identified, new pyrosequencing assays were developed. In the Towson 
Human Remains Identification Laboratory (THRIL) lab, we have performed pyrose- 
quencing with cg00211609 (body of FAM132A) using 20 monozygotic twin sets. The 
twin-set samples were obtained from the Coriell Institute for Medical Research, 
extracted using the phenol-chloroform method, and quantitated with the Qiagen Quan- 
tiplex Pro RGQ Kit using a Qiagen RotorGene Q real-time PCR instrument. After 
quantitation, the samples were bisulfite-treated with the Qiagen EpiTect Plus Bisulfite 
Conversion Kit. After bisulfite treatment, the samples were amplified with the Qiagen 
PyroMark PCR kit and sequenced on a Qiagen Q48 pyrosequencing instrument with 
the Qiagen Gene Globe assay, which includes the primer cg00211609. Out of the first 
13 monozygotic twin pairs that were successfully sequenced, several pairs exhibited a sta- 
tistically significant difference in percent methylation when using a paired t-test. Even 
though some twin pairs could not be differentiated with the one primer, other primers 
and CpG sites may display differences in methylation. We are continuing to examine 
additional loci, including cg26287080 (body of EXOC7) and cg01558909 (promoter 
of HBM), with more sets of monozygotic twins. 


Conclusion 


The advancements in NGS technologies have allowed for higher throughput, so that 
large portions of a suspect’s genome can be sequenced. Sanger sequencing, NGS, pyro- 
sequencing, and PCR HRM can be used to differentiate monozygotic twins. With this 
technology, differences in the mitochondrial genome, nuclear genome, miRNAs, and 
DNA methylation patterns can be detected. These differences can be used to differentiate 
monozygotic twins, which cannot be differentiated with STR typing due to monozy- 
gotic twins having identical STR profiles. Multiple studies have been released finding dif- 
ferences in monozygotic twins, with the various research studies focusing on different 
portions of the nuclear or mitochondrial genome. In addition, with the exception of 
high-resolution melt, the methods discussed are expensive to perform with expensive 
instrumentation not widely available in forensic laboratories, which can be an issue for 
their implementation in forensic criminal casework. Tissue variation remains a challenge 
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Figure 10.2 Methods for detecting 
DNA methylation. 


for identifying differences in monozygotic twins that vary across all sample types. There- 
fore, further research is required to identify loci that can be used to differentiate the most 
pairs of monozygotic twins, decrease the cost, and standardize and validate the methods 


so that one or more can be used in criminal cases and be accepted into forensics 
(Fig. 10.2). 
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Introduction 


An individual’s genetic uniqueness is determined by the combination of genetic markers 
inherited from his or her parents. A genetic marker is a gene or a short segment of DNA 
that has a distinctive location on a specific chromosome that can be used for the identi- 
fication of a species or an organism. It can be as short as a single nucleotide polymorphism 
or a longer sequence, such as mini and microsatellites. Genes on a chromosome exist in 
alternate forms known as alleles that differ in sequence. The goal of genetic marker typing 
is to determine which alleles are present at genetically variable loci (Jordan & Mills, 2021; 
Khan et al., 2017, 2020). There are three types of DNA markers: Autosomal, paternal, 
and maternal. Autosomal markers are unique DNA sequences scattered across the 
genome that are used in forensics, population genetics, pedigree analysis, missing persons’ 
identification, and genealogy. Recently, the interest of scientists has shifted toward the 
use of maternal or mitochondrial DNA (mtDNA) markers for genetic analysis such as 
the identity of missing persons, linkage of suspects to the crime scene, and all sorts of 
evolutionary and genealogical analysis. The use of mtDNA for the identification of an 
organism especially from ancient remains has several advantages over autosomal DNA. 
First, the presence of a high copy number of mtDNA in each cell as compared to auto- 
somal DNA, which has only two copies per diploid cell (Cole, 2016; Michael Hofreiter 
et al., 2021; Parson et al., 2013). Second, mtDNA is maternally inherited and follows the 
nonmendelian inheritance pattern, thus lacking recombination, making it easier to trace 
unique sequences through the generations (Zhang et al., 2018). Moreover, in ancient re- 
mains the DNA is often found degraded while the mtDNA is more likely to survive for 
longer periods, providing enough DNA to be sequenced and compared. These advan- 
tages make mtDNA a powerful tool to be used in evolutionary and genealogical studies 
extensively applied to ancient samples, tying ancient remains with living descendants 
(Chyleriski et al., 2019; Cui et al., 2013; Hofreiter et al., 2021; Modi et al., 2017; Wil- 
lerslev & Cooper, 2005). 
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Due to the practical and technical hurdles involved in the sequencing of the entire 
mitochondrial genome using the Sanger sequencing method, the analysis of mtDNA is 
limited to a few nucleotide-long segments approximately 700 bases), i.e., hypervariable 
regions I and II (HVI and HVII). Most of the technical difficulties are addressed by the 
introduction of new high-throughput technology, including next-generation 
sequencing) (NGS) methods. The first of these NGS approaches was reported in 2005 
and implemented in ancient DNA (aDNA) research (Margulies et al., 2005). These 
methods have enabled the sequencing of millions of fragments at once, reducing the 
cost and complexity associated with the sequencing of ancient mtDNA. The use of 
such high-throughput methods for whole mitochondrial genome analysis resulted in 
an increase in the percentage of unique haplotypes from 65% to 90%—100%. In addition, 
unlike conventional sequencing techniques, NGS allows for the detection of misincor- 
porated patterns, which aids in distinguishing between endogenous aDNA and current 
contaminants (Kirkinen et al., 2022; Senovska et al., 2021). 

This chapter discusses the key role NGS has played in unraveling the historical secrets 
and evolutionary patterns by generating a bulk of data based on maternal DNA signa- 
tures. The chapter also discusses the procedure opted to prepare aDNA samples for 
NGS, mtDNA capture from highly degraded samples, and bioinformatics tools used 
for the analysis. The chapter also sheds light on the comparative advantage of using 
NGS over traditional techniques for mtDNA profiling from ancient samples. 


What stories ancient DNA analysis tell us? 


aDNA analysis helps in addressing a number of important questions that were previously 
impossible to answer. The majority of aDNA research is focused on the origin, relation- 
ship, genetic diversity, and migration of human populations. The classification of human 
remains into a specific group or a well-defined population is performed using aDNA 
analysis (Carlyle et al., 2000; Keyser-Tracqui et al., 2003; Kolman, 1999, pp. 
183—200; Krings et al., 2000; Lalueza et al., 1997). The aDNA retrieved from the re- 
mains of hominins and Neanderthals has been analyzed to clarify their relationships 
with present-day humans (Gibbons, 2021; Hawks, 2021; Vernot et al., 2021). 

aDNA analysis has enabled scientists to get insight into the reasons behind extinction 
by analysis of extinct species. The DNA from the ancient remains of different extinct 
creatures has been extracted and analyzed to investigate the genetic variations. Obtained 
outcomes are compared to the data from genomic libraries containing information 
regarding the genomic signatures of different species from the past (if studied) and pre- 
sent. Then attempts are made to find the possible cause behind the widespread extinction 
of these species (Bruniche—Olsen et al., 2018; Orlando et al., 2002; Woods et al., 2021). 

The analysis of aDNA has been used for the identification of the sex of ancient 
human skeletons using genetic markers on the X and Y chromosomes. In most cases, 
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the remains found are degraded, due to which the sex cannot be determined morpho- 
logically. The data obtained from such studies can be used to shed light on marriage 
patterns, burial customs, and differential rates of mortality (Nistelberger et al., 2019; 
Wasef et al., 2020). 

The identification of kin relationships in the case of group burial is also one of the 
applications of aDNA analysis. When a group of human skeletons is discovered, it is of 
interest to find out if the individuals are related to a family unit. For example, a study 
involving 62 specimens from a 2000 BP necropolis was conducted to reconstruct the 
kinship of specimens. The DNA from skeletal remains was extracted and amplified for 
single-copy autosomal microsatellite markers. The results revealed close relationships, 
including few parent/child relations (Bruck, 2021; Drosou et al., 2018; Gad et al., 
2021; Keyser-Tracqui et al., 2003). 

The genetic analysis of ancient fecal samples can be used to investigate dietary patterns 
and find out what plants and animals are consumed by the defecating individual. More- 
over, the fecal samples can also be used as an alternate source of DNA to identify the in- 
dividuals, as in some cases the coprolites provide a better source of aDNA as compared to 
skeletal or mummified tissues (Hofreiter et al., 2000; Poinar et al., 1998). 

Nonhuman remains are also subjected to aDNA analysis to investigate certain aspects 
of human prehistory and to reconstruct and enhance our understanding of the domesti- 
cation process and agriculture. The DNA from animal remains can be extracted using the 
same techniques used for human DNA (Barnes et al., 2002; Jaenicke-Després et al., 2003; 
Leonard et al., 2002). The aDNA analysis has shown that present-day dogs are known to 
have interbreed with dogs brought by European colonists (Leonard et al., 2002). Simi- 
larly, analysis of an ancient maize sample from Mexico showed that alleles in each of 
the three important genes that are predominant in maize nowadays were already present 
4000 years ago (Jaenicke-Després et al., 2003). The analysis of ancient floral and faunal 
remains helps us to understand the ecosystem in which prehistoric humans lived; further- 
more, the analysis of the molecular diversity of animal and plant remains also gives clues 
about the hunting and dietary habits of humans. The analysis of ancient animal remains 
also attempted to explain the movements of prehistoric populations (Matisoo-Smith 
etall,,.1997). 

aDNA analysis addresses another essential question related to the evolution of hu- 
man pathogens and the pattern of prehistoric diseases (Gad et al., 2021). aDNA 
studies have been performed to identify different human pathogens such as anthrax, 
bubonic plaque, tuberculosis (TB), and influenza (Basler et al., 2001; Donoghue 
et al., 1998; Drancourt & Raoult, 2002). The spread of TB, caused by Mycobacterium 
tuberculosis, was traced back to European colonization; however, the aDNA analysis 
identified TB in a 1400-year-old Byzantine fragment of calcified lung tissue, which 
confirmed the presence of the pathogen before European contact (Donoghue et al., 
1998, 2004). 
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Challenges faced in studying ancient DNA 


Investigations based on aDNA have allowed scientists to understand the major historical 
events that shaped the fate of modern humans. However, the road to this analysis was full 
of challenges. Despite huge advancements in molecular biology techniques, some of 
these challenges are still faced by researchers. 


Contamination 


Amplification of contaminant DNA is a problem in all studies involving polymerase 
chain reaction (PCR); however, this problem is even more severe when dealing with 
ancient human DNA, which is often found in very low amounts. The ancient and his- 
toric biological samples discovered at excavation sites yield a very low amount of DNA; 
moreover, this DNA is often contaminated and modified in different ways, which makes 
it difficult to amplify the authentic DNA extracted from these samples. Contamination of 
a single molecule of modern DNA can cause errors in the analysis (Furtwangler et al., 
2018; Grigorenko et al., 2009). Even with the use of highly specific and rigorous proced- 
ures (Hofreiter et al., 2001; Poinar & Cooper, 2000), modern human DNA contamina- 
tion is still very prevalent in the amplification of aDNA extracted from ancient remains 
(Hofreiter, n.d.; Krings et al., 2000). In the case of bone and teeth ancient samples, it is 
almost impossible to completely purify the endogenous DNA from modern DNA 
contamination. Decontamination procedures such as ultraviolet C (UVC) and bleach 
treatments are also not efficient enough to fully eradicate the contamination (Gilbert 
et al., 2005). That may be due to the presence of pores in the bone and tooth dentine 
that serve as the main entry sites for DNA, especially from sweat, skin fragments, and 
exhaled cells. The teeth are less prone to contamination when present in well- 
preserved form as compared to bone fragments. Hairs also seem to be less prone to 
contamination and are considered one of the most reliable sources for ancient human 
DNA analysis, despite their less common presence in ancient samples as compared to 
skeletal remains (Gilbert et al., 2004). 

Different studies have reported the presence of modern human DNA contamination 
in ancient samples. For example, a study on Neolithic and Upper Palaeolithic samples 
discovered in Spain sequenced the mtDNA hypervariable region 1 (HVR1) and 
compared it with the DNA obtained from six individuals working on these samples. 
The comparison of their sequences showed that 17.13% of the 572 cloned sequences 
from aDNA were actually contaminants from the individuals working on the project 
(Caramelli et al., 2008; Sampietro et al., 2005). 


Chemical modifications and degradation of ancient DNA 

The alterations and mutations in postmortem DNA are also one of the main problems in 
complex aDNA analysis. The amplification of aDNA becomes very difficult compared to 
DNA extracted from a fresh sample because of the alterations and degradations of aDNA. 
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One of the most obvious types of DNA damage of aDNA is its degradation to small 
fragments. This reduction in the size of DNA fragments is caused by enzymatic actions 
that occur right after the death and also nonenzymatic hydrolytic cleavage of phospho- 
diester bonds creating single-stranded nicks (Geigl & Grange, 2018; Lindahl, 1993; 
Shapiro, 1981, pp. 3-18). The hydrolytic cleavage of glycosidic bonds between nitrog- 
enous bases and the sugar backbone results in abasic sites, which then undergo chemical 
rearrangement leading to strand breakage (Mohni et al., 2019; Lindahl & Nyberg, 1972; 
Lindahl & Karlstrom, 1973; Shapiro, 1981, pp. 3-18). The extent of DNA degradation 
depends on the sample age, the geographic origin of the sample, and also the environ- 
mental conditions where the sample was stored (Grigorenko et al., 2009; Gabriel Renaud 
ét al. 2019), 

Unlike metabolically active cells, postmortem cells that are in dormant conditions do 
not have an active DNA repair system and are prone to hydrolytic or oxidative DNA 
modifications and strand damage. The most critical modification the DNA undergoes 
is the one that induces alteration in the bases but does not halt the process of amplifica- 
tion, leading to nucleotide modification in the amplification products. These modifica- 
tions include type I substitutions (A > G/T > C) and type II substitutions (C > T/ 
G > A). Different methods are used in aDNA analysis to minimize postmortem DNA 
modification, which is a prerequisite to enhance the overall quality and concentration 
of DNA templates. These methods make use of two important enzymes: Uracil-N-gly- 
cosylase for eliminating the deamination products of cytosine (Gilbert et al., 2003; 
Hofreiter et al., 2001) and N-phenacylthiazolium bromide necessary for breaking the 
intermolecular cross-links (Mohni et al., 2019). Moreover, the application of high- 
fidelity DNA polymerase such as pfu Taq HiFi also reduces the rate of sequence errors, 
thus increasing the efficacy of aDNA amplification (Cooper et al., 2001). 


Next-generation sequencing for maternal genome analysis in ancient 
samples 


NGS has provided solutions to several limitations that were halting the scientists from 
gathering the information contained in human ancient samples. One of the key limita- 
tions faced in the analysis was contamination from several sources, which challenged the 
authenticity of the obtained results. In 2006, two studies were performed on the same 
Neanderthal sample, and the outcomes of the studies produced inconsistent results due 
to sample contamination (Green et al., 2006; Noonan et al., 2006). As a consequence 
of the faced issues, a set of criteria has been set to reduce the risk of sample contamination. 
A key step in eliminating the maximum possibility of contamination is the cleanliness of 
the NGS facility (Green et al., 2006, 2009; Wall & Kim, 2007). Experimentation 
revealed that samples prepared for NGS in a cleaned environment had minimal contam- 
ination of the modern human genome. Contrarily, sample preparation commenced in a 
nonclean laboratory had modern human DNA as a contaminant (Green et al., 2009). 
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Despite the ancient sample contamination with the modern human genome, a new set of 
tools in NGS provides benefits in analyzing aDNA by identifying and avoiding contam- 
ination. a NA has genomic material with very short sequences. NGS, in comparison to 
PCR, allows the detection and sequencing of those molecules and avoids the amplifica- 
tion of larger genomic segments (Krause., et al., 2010). 

Ancient samples are sensitive, and special care is required for their preparation for 
NGS. Similarly, sample contamination with other sources, such as human, bacterial, or 
fungal, often compromises the result of aDNA experimentation (Knapp et al., 2012; 
Miller et al., 2009). Therefore, caution is practiced when preparing samples from ancient 
sources to get maximum results. 


Facility for ancient sample processing 


The spatial separation of the aDNA facility from the postPCR processing laboratory is 
essential to avoid the chances of cross-contamination. Further, the movement from 
the aDNA facility to the postPCR laboratory should be unidirectional, and laboratory 
workers must establish a practice to visit aDNA facility before visiting the postPCR 
room. Multiple visits to the aDNA facility from a postPCR laboratory on the same 
day enhance the risk of contamination. Therefore, a facility should be designed in 
such a way that entrance to aDNA facility should not be from a postPCR laboratory (Far- 
rer et al., 2021; Knapp et al., 2012; Pinhasi et al., 2019). Further, such facilities often have 
a limited-access policy, and access to aDNA facility is only available to trained individuals 
who are aware of the protocols necessary for reducing the contamination risk. 

aDNA facility spatial isolation from postPCR laboratory must be accompanied by the 
separation of space for conducting the different steps of a DNA processing in the aDNA 
facility. As steps for processing ancient samples can vary because of the nature of the sam- 
ple, therefore, it is acceptable to at least have three separate rooms or hoods for con- 
ducting specific experimental activities (Fulton, 2012). One is for changing personal 
protective equipment, consumables storage, UVC irradiation-based decontamination 
of samples and consumables, and preprocessing of museum samples that may be contam- 
inated. The second room is for the processing (cutting or grinding) of bones. This room 
should ideally have an internal UVC-equipped fume hood. The third room is for per- 
forming DNA extraction and DNA amplification experiments (Templeton et al., 2013). 

Protocols for reducing contamination are particularly necessary for preparing aDNA 
samples for NGS. A little amount of nontarget DNA can be a source of nuisance, inter- 
fering with the sequence read of sample DNA. Thus, reducing the cross-contamination 
risk using hairnets, gloves, facemasks/shields, HEPA-filtered air conditioning, positive- 
pressure systems, fume hoods, and laminar flow hoods is a prerequisite for achieving 
nuisance-free peaks in NGS sequence plots. 
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Sample preparation and DNA extraction from ancient samples 


Samples for NGS are prepared in an ultraclean environment to avoid modern DNA 
contamination, and much deliberation is practiced in dealing with ancient samples. Un- 
like fresh specimens, ancient samples lack cells or tissues, a possible reason for the low- 
available concentrations of endogenous DNA. Similarly, ancient samples are often going 
through various states of degradation or are severely damaged. Therefore, the method- 
ology for the extraction of DNA must not be so rigorous, and the use of chemical de- 
tergents and high temperatures must be avoided. Although such procedures result in a 
high release of DNA, they could inflict more damage to the aDNA, leading to a poor 
overall yield of DNA. Lastly, ancient samples are contaminated with numerous PCR in- 
hibitors that may interfere with the DNA amplification processes. So, the method for 
DNA extraction from ancient bone or tooth samples must be efficient enough to deal 
with these problems. Two methods used for the initial processing of ancient samples 
are the Loreille and Dabney protocols. These protocols are shown to work best for 
ancient samples and burned samples, respectively. However, alteration in these protocols 
based on the amount of initial powdered sample used and the composition of lysis buffer 
is shown to make these methods more effective for high-throughout the mtDNA analysis 
in ancient samples (Xavier et al., 2021). 

These methods to obtain DNA from ancient samples are destructive for the sample 
itself and majorly rely on drilling, powdering, or cutting. On the obtained powdered 
samples, methods such as isopropanol or ethanol precipitation, isolation through mem- 
branes, or silica method can be employed for DNA molecule extraction (Damgaard et al., 
2015; Rohland & Hofreiter, 2007). Silica method to date is the most preferred method 
due to the comparatively high yield of DNA (Gamba et al., 2014; Rohland et al., 2010, 
2018). This is a 2-day method that begins with the removal of surface dirt that may 
contain several sorts of inhibitors. Such inhibitors can interfere with the extraction pro- 
cess or disrupt the enzymatic manipulation of the extracted DNA at later stages. The 
outer surface of the ancient sample is carefully removed using a single-use grinding 
tool. This step is critical as it can enhance the endogenous-to-exogenous DNA ratio. 
In the next step, a compact part of the bone is cut off, or a tooth root or dentine is 
removed. As such procedures are irreversibly damaging, curators usually do not allow 
the removal of parts belonging to unique specimens. In such a scenario, a drilling tech- 
niques are employed, and fine powder is obtained without major damage to the sample. 
Drilling is performed at low speed to avoid heat generation that could possibly deteriorate 
aDNA (Dabney et al., 2013; Rohland & Hofreiter, 2007). In a few labs, tooth samples are 
cavitated and ultrasonically cleaned prior to drilling, while the UV-irradiation method 1s 
employed in other laboratories to decontaminate samples (Harney et al., 2021; Matsvay 
et al., 2019). The finely ground powder is obtained by mortar and pestle crushing, a ball 
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mill, or a freezer mill (Matsvay et al., 2019; Rohland & Hofreiter, 2007). The finer the 
powder; the more the DNA yield. The powdered sample is then mixed with the extrac- 
tion solution and incubated overnight on gentle shaking in the dark (Rohland & 
Hofreiter, 2007). One study suggested that predigestion of powdered samples with 
EDTA (0.5M), proteinase K, and N-laurylsarcosyl solution improves the yield by 
~ 2.5-fold (Damgaard et al., 2015). The next day, a sample is mixed with binding buffer 
and silica suspension, and pH is maintained between 4.0 and 3.5. pH below this range 
may damage DNA. The solution is incubated for 3 h at agitation for alowing DNA 
binding with silica. The choice of binding buffer type and purification matrix greatly im- 
pacts the yield and DNA fragment that will be eluted in later stages (Rohland et al., 
2018). Once a DNA-silica suspension is obtained, binding buffer is removed through 
washing, and DNA is eluted through DNA elution buffer (Rohland & Hofreiter, 2007). 

Recent investigations reported the methods to minimize ancient sample destruction 
during its processing. The study demonstrated the use of targeted bleaching at the ancient 
tooth root. Most studies preferred the tooth root or dentin for extraction purposes. A 
study by Harney and his team extracted DNA from the cementum of the tooth. A 
bone-like layer, cementum, is present in the root of the tooth that is enriched with 
cementocytes. These cells are preserved after death under the mineral structures of the 
tooth and are a good source of DNA. Hamey and colleagues exposed ancient tooth roots 
to lysis solution for a minimal duration, which caused less destruction to the sample, and 
the yield of DNA was also equivalent to the destructive processing methods for ancient 
samples (Harney et al., 2021). Despite the availability of numerous extraction protocols, it 
will be wise to choose an extraction method based on sample type, degree of degradation, 
and sample age. 

Commercially available kits are also employed for extracting DNA from ancient sam- 
ples, which more or less work on the same principles. Recently, the commercially avail- 
able kit PrepFiler BTA forensic DNA extraction kit was used to extract aDNA from 
human tooth samples. This kit uses magnetic beads for DNA extraction. This kit is 
designed for extracting degraded DNA from forensic evidence and provides its applica- 
tion in dealing with ancient samples (Matsvay et al., 2019). The concentration of 
extracted aDNA is usually determined through a Qubit dsDNA HS Assay Kit on a Qubit 
2.0 fluorimeter (Matsvay et al., 2019; Neparaczki et al., 2017). 

A special module system was introduced in 2018 to process ancient samples for NGS. 
The sample consisted of teeth from the skull discovered during the seventeenth-century 
burial site excavation in Radonezh, Russia. Both sample preparation and DNA extrac- 
tion were performed in the four-glove box module system. Antechambers connected 
each box and had an evacuation system that allowed a high purity atmosphere for the 
preparation of samples. Molecular filters, present in the air purification system, prevented 
the entrance of molecules larger than nitrogen atoms maintained the sterility of the boxes, 
and allowed sample processing in the oxygen and organic molecules’ absence. Each step 
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of the extraction was conducted in a separate chamber. This system was beneficial as each 
box was connected, and the chances of contamination during sample transfer from one 
lab to another were also avoided (Matsvay et al., 2019). 


Library preparation for next-generation sequencing 


aDNA is often degraded, and its segments have blunted or overhanging ends. Addition- 
ally, aDNA can possess deaminated cytosines, converting them into uracil bases. aDNAs 
are repaired for these shortcomings using two different methods. Through the first 
method, uracil bases from the sequence are removed, and flanking 5’ and 3’ ends are 
repaired to produce a ligatable blunt end. The reaction is performed by adding three en- 
zymes: T4 DNA Polymerase (Thermo Fisher Scientific Inc.), T4 PolyNucleotide Kinase 
(T4 PNK, Thermo Fisher Scientific Inc.), and T4 DNA Ligase (Thermo Fisher Scientific 
Inc.). The first two enzymes repair the overhanging ends through their 3’-5’ exonuclease 
activity and phosphorylation ability, respectively. The last enzyme ligates the blunt ends 
to synthetic adapters. Other methods only remove flanking regions and do not excise 
uracil (Briggs & Heyn, 2012; Matsvay et al., 2019). Repaired aDNA is then purified using 
spin columns such as the MinElute spin column, QIAGEN, or carboxyl-coated magnetic 
beads (Sera-Mag SpeedBeads) (Briggs & Heyn, 2012; Neparaczki et al., 2017). In the 
next step, adapter sequences are attached to the repaired sequences, and the entire library 
is subjected to amplification through PCR. The amplification procedure makes use of 
the universal priming sites that are introduced through adapters. The final concentration 
for the amplified product should be 50 ng/L and the libraries below must be reamplified 
(Neparaczki et al., 2017). The use of Phusion enzyme a proofreading enzyme is recom- 
mended during second amplification to avoid PCR errors (Briggs & Heyn, 2012). 


Mitochondrial DNA capturing and sequencing 


mtDNA-capturing biotinylated probes are prepared from long-range, overlapping 
amplified products. These probes are designed using NimbleGen, a proprietary software 
developed by Calloway and his team (Calloway et al., 2015). This software employs the 
information from a human mitochondrial genome population database, global mtDB, 
and targets the consensus sequences in mtDNA. The tiling approach is applied in probe 
designing to obtain targeted overlapping sequences with high redundancy. These probes 
are 50—100 bp in length that could effectively target most of the mt DNA genome (Shih 
et al., 2018). Roche NimbleGen SeqCap EZ Library Developer Library kit is generally 
used during hybridization experiments. Samples for hybridization consist of probes 
(100 ng) and library DNA (400 ng). The thermal profile generally followed for hybrid- 
ization is DNA denaturation for 5 min at 95°C and overnight (14-18 h) incubation to 
allow probes to bind to complementary sequences on library DNA. Blocking oligonu- 
cleotides are also included in the mix to shun nonspecific hybridization and promote 
strand displacement (Neparaczki et al., 2017; Shih et al., 2018; Templeton et al., 
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2013). The number of probes varies from two to four. The sequence of probes generally 
used is below: BO1.P5.parti1F: AATGATACGGCGACCACCGAGATCTACAC- 
Phosphate; BO2.P5.part2F ACACTCTTTCCCTACACGACGCTCTTCCGATCT- 
Phosphate; BO4.P7.part! R) GIGACTGGAGTTCAGACGTGTGCTCTTCCGA 
TCT-Phosphate; and BO6.P7.part2 R CAAGCAGAAGACGGCATACGAGAT- 
Phosphate. Dynabeads MyOne Streptavidin C1 magnetic beads (Thermo Fisher) are 
used to capture biotinylated probe-DNA complexes, and release of library DNA is 
achieved by treating the captured complex with the strand-displacing Bst DNA polymer- 
ase enzyme (large fragment, New England Biolabs). Amplification of hybridized products 
is accomplished through KAPA HiFi HotStart Ready Mix (KAPA Biosystems) and 
amplification is done for 10—15 PCR cycles. The amplified product is again purified us- 
ing a MinElute column or magnetic beads, the purified product is quantified and sub- 
jected to sequencing. Systems used for sequencing are the MiSeq sequencer, and the 
sequencing reaction is prepared using MiSeq Reagent Kit v3 (Illumina, MS- 
102—3003) (Shih et al., 2018; Templeton et al., 2013), Ion Torrent Personal Genome 
Machine (PGM; Life Technologies), and Ion PGM 200 sequencing kit v2 chemistry 
(Life Technologies) (Templeton et al., 2013). 


Next-generation sequencing data analysis 


Software used for studying the mtDNA data obtained after NGS are NextGENe v.2.4.2 
(SoftGenetics LLC., State College, PA, USA) and GeneMarkerHTS v.1.2.2. Resulting 
reads are mapped on the genome assembly using Burrows Wheeler Aligner v0.7.9 soft- 
ware. Initially, the genome assembly GRCh37.75 human genome reference sequence 
was used to access the revised Cambridge Reference Sequence (rCRS, 
NC_012,920.1) for mtDNA (Mudge et al., 2021; Neparaczki et al., 2017). Recent 
studies, however, prefer the Reconstructed Sapiens Reference Sequence for mapping 
mtDNA haplotype data (Behar et al., 2012; Sarmento, n.d.). More tools such as Picard 
Tools v 1.113 for data cleaning for duplicate occurrence, SAMtools v1.1 for data sorting 
and indexing (Etherington et al., 2015), mapDamage2.0 for aDNA damaged pattern 
assessment (Jonsson et al., 2013), and FreeBayes v1.02 for genetic variant identification 
(Garrison & Marth, 2012). 


Next-generation sequencing based on ancient sample mtDNA in 
evolutionary studies 


DNA samples from ancient remains were largely studied using PCR and Sanger 
sequencing before the invention of high-throughput NGS. Sanger sequencing was 
once thought to be the gold standard for DNA sequencing; however, it had a low 
throughput and was hence expensive, especially for large-scale sequencing investigations. 
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Furthermore, library preparation and amplification, as well as colony selection, are time- 
consuming and inefficient processes (Ruzzi et al., 2012). These limitations of traditional 
sequencing are even more critical in the case of a DNA, and the emergence of NGS has 
opened up entirely new possibilities and expanded the area of applications (Millar et al., 
2008). The new generation of NGS sequencers has transformed aDNA research by 
increasing the number of bases sequenced per run while decreasing sequencing costs. 
The most widely used NGS technologies in the field of aDNA analysis are the 454/ 
Roche FLX and the Illumina Genome Analyzer (Margulies et al., 2005; Millar et al., 
2008; Shendure et al., 2004). A significant amount of research has been done utilizing 
the applications of NGS in analyzing mtDNA retrieved from ancient samples. 


Mitochondrial DNA from coat hair of mammoth 


Hair shafts are a promising source of aDNA. Several properties of shafts suggest that they 
constitute an attractive DNA source for NGS. They are mostly present in abundance, 
which makes them preferable to bones, as the destructive sampling can cause the loss 
of important morphological information. Moreover, the turnover of keratinocytes in 
the hair bulb is significantly high; therefore, the mitochondrial levels in these cells may 
be comparatively higher than those in the other cells used for the analysis of aDNA 
(Vanscott et al., 1963). mtDNA was extracted from the coat-hair shafts of a Siberian 
woolly mammoth (Mammuthus primigenius) found in northern Siberia’s permafrost de- 
posits. The use of hair shafts in combination with sequence by synthesis resulted in 10 
full mitochondrial genome sequences with 7.3—48.0-fold coverage. The yield of 
mtDNA sequence was about 26 times higher than from the previously reported 
permafrost-preserved bone (Poinar et al., 2006). Even though one of the specimens 
was > 50,000 14C years old and another had been maintained at ambient temperature 
for 200 years, the observed levels of sequencing errors were lower than those found in 
previously published frozen bone samples. This method therefore prepared the ground- 
work for molecular-genetic examination of museum collections. Furthermore, the rate 
of damage-derived sequencing errors was also reported to be lower than those observed 
in previously published frozen bone samples, irrespective of the age of the sample, as one 
of the specimens included in the study was > 50,000 years old (Gilbert et al., 2007). 

After that, five new full genomes of Siberian woolly mammoths were sequenced us- 
ing 454 sequencing NGS technology coupled with conventional PCR-based approaches 
with up to 73-fold coverage from the hair shaft. Three out of these five genomes repre- 
sent the first-ever clade II mammoth genomes. The complete analysis of these and pre- 
viously reported mitochondrial genomes demonstrated that there are two mitochondrial 
clades (clade I and clade II), both of which are highly diverged. Furthermore, statistical 
examination of the combined datasets revealed that clade II disappeared from Siberia 
long before clade I (Gilbert et al., 2008). 
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Mitochondrial DNA of Paleo-Eskimo 


The Independence I-Saqgaq and Pre-Dorset Cultures, which covered roughly 
3900-2500 14C years before now (BP), and the Independence H/Dorset Cultures, 
which spanned around 2500—700 14C years before present (BP), are known collectively 
as the Paleo-Eskimos (Eskimo, 1984). They represent the earliest human expansion into 
the northernmost reaches of the New World. The classification of modem and ancient 
Neo-Eskimo has been performed using the analysis of the hypervariable I (HVS1) region 
of mtDNA. A 3400—4500-year-old frozen hair sample was retrieved from an early Saq- 
qaq settlement of paleo-Eskimo humans and sequenced using NGS. The sequencing data 
revealed that the sample varies from modern Native Americans and Neo-Eskimos and 
belongs to haplogroup D2a1 similar to the one previously found in modern Aleuts 
and Siberian Sireniki Yuits. These findings show that the migrations that brought early 
people to the far north of the New World did not come from Native Americans or 
the groups who eventually gave rise to the Neo-Eskimo expansion from Alaska around 
1000 years ago (Gilbert et al., 2008). 


The genetic history of Neanderthals (the divergence between extant humans 
and Neanderthals) 

Neanderthals were a type of hominid that lived in Europe and Western Asia until 
approximately 30,000 years ago. They are believed to be more closely related to modern 
humans than chimps, which is why studying their genomes provides a unique opportu- 
nity to discover genetic modifications that are specific to anatomically fully developed 
modern humans. The first molecular genetic analysis of Neanderthals was performed 
in 1997. A Neandertal-type specimen was discovered in the Neander Valley, near 
Dusseldorf, Germany, in 1856. A 379-base-pair-long section of HVRI of the mitochon- 
drial genome was determined (Krings et al., 1997). 

High-throughput 454 sequencing techniques have recently been used to study the 
mtDNA of Neanderthals. The main advantage of using 454 sequencing technologies 
is the huge volume of sequence data that is generated by these technologies, making 
aDNA sequencing studies feasible. These NGS technologies also help in increasing 
our understandings of the modifications and degradation of DNA that occur during 
deposition in burial contexts. 

One of the studies reported an entire mitochondrial genome from the bone of a 
38,000-year-old Neandertal individual using 454 sequencing. The high-throughput 
sequencing identified 8341 sequences of mtDNA from 4.8 GB of genetic material ob- 
tained from 0.2g of bone. The assembled sequence conclusively indicates that Neandertal 
mtDNA is distinct from modern human mtDNA and that the Neandertal mtDNA line- 
age diverged from the current human mtDNA lineage around 660,000 years ago. This 
means that the most recent common ancestor of Neandertal and human mtDNA lived 


Application of NGS in maternal genome analysis in ancient human remains 


more than twice as long ago as the most recent common ancestor of current human 
mtDNAs (Green et al., 2008). 


mtDNA analysis of a Hominin from South Siberia 


Other than Neanderthals, whose DNA sequences have already been determined in a 
large number of individuals, the genetic linkages of other hominin lineages are mostly 
unknown. Despite recent developments in the extraction of aDNA from Neanderthals 
and early modern humans (Gilbert et al., 2007; Krause et al., 2010; Krings et al., 
1997), DNA sequences have not been recovered from other Pleistocene hominins 
such as H. erectus, H. heidelbergensis, or Homo antecessor. One possible explanation for 
this is that in the case of ancient samples, the DNA was exposed to harsh environmental 
conditions for longer periods of time. DNA degradation accelerates with conditions 
like increased temperature and soil acidity (Smith et al., 2003). 

A complete DNA sequence was retrieved from a bone unearthed in 2008 in Deni- 
sova Cave in southern Siberia using the Illumina platform. The mtDNA analysis 
showed that it represents an unknown kind of hominin mtDNA that shared a common 
ancestor with modern humans and Neanderthal mtDNAs around one million years 
ago. This suggests that it comes from a different hominin migration out of Africa 
than the forebears of Neanderthals and modern humans. The stratigraphic analysis of 
the cave where the bone was discovered implies that the Denisova hominin lived in 
close proximity to both Neanderthals and modern humans in both time and space 
(Krause et al., 2010). 


mtDNA analysis of ancient mummies 


The people of ancient Egypt believed in an afterlife. They preserved the integrity of their 
bodies through artificial mummification because in their belief the integrity of their phys- 
ical bodies was important to continue their existence in the hereafter (Ikram, 2003). The 
mtDNA analysis using NGS technologies has been found to be instrumental in revealing 
the mysteries of these Egyptian mummies. 

The first NGS-based mtDNA analysis of mummified samples that were obtained 
from 806 BCE to CE 124 was performed to trace the ancestral lineage of ancient Egyp- 
tians. The study characterized five Egyptian mummies and two excavated Bolivian low- 
land skeletons. In addition to human DNA, the sequencing libraries also included DNA 
from two parasites (Plasmodium and Toxoplasma), indicating the existence of the respec- 
tive infections in mummies (Khairat et al., 2013). 

Another study obtained 90 mitochondrial genomes and genome-wide datasets from 
three individuals from Egyptian mummies. The findings revealed that ancient Egyptians 
were more related to Near Easterners than modern-day Egyptians, who have received 
more Sub-Saharan mixing in recent times (Schuenemann et al., 2017). 
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In the last decade or so the entire mitochondrial genomes of the mummies have been 
sequenced, and their haplogroups have been identified (Gémez-Carballa et al., 2015). 
The oldest complete mitochondrial genome sequenced belonged to a prehistoric Euro- 
pean, the Tyrolean Iceman, who lived around 5350—5100 years ago. The mtDNA ob- 
tained from the corpse was amplified and sequenced. The comparison of mtDNA with 
related extant lineages revealed that the Iceman belonged to the mitochondrial hap- 
logroup K1 (Ermini et al., 2008). 

Furthermore, the complete genome of a 7-year-old victim of an Inca child ritual that 
took place around 500 years ago. The mummy of the child was discovered in the Argen- 
tinian province of Mendoza. The analysis of mt DNA showed that the mummy belonged 
to the C1bi mitochondrial haplogroup, which matches the current haplogroups of Peru 
and Bolivia (GOmez-Carballa et al., 2015). 

Overall, it appears that NGS technologies have sped up the mtDNA analysis of 
ancient mummies and paved the way for future research in this field. However, more 
extensive research is required in order to validate the findings obtained through archeol- 
ogy, literature, and anatomy. 


Recent developments 


The field of aDNA has been revolutionized by the rapid advancement in next generation 
technologies, which allow the amplification of very small aDNA fragments and yield a 
high amount of sequence data at a low cost. These NGS techniques have made it possible 
to detect misincorporations and have helped in the detection of modern exogenous 
DNA contaminations (Emery et al., 2022; Kirkinen et al., 2022). Such developments 
enhanced the authenticity of the aDNA analysis results. 

The major challenge researchers faced in the analysis of NGS-produced data was its 
management. The data produced through high-throughput NGS technologies after 
aDNA analysis is in gigabytes and there was a need to harness advancements in bioinfor- 
matics and statistical tools to make the analysis of such huge data easier. Previously, bio- 
informatics tools such as mapDamage (Ginolhac et al., 2011) and mapDamage2.0 
(Jonsson et al., 2013) were developed that helped in handling and identifying degraded 
sequences. Recently, a computational tool, damageProfiler, was developed that allows 
the detection of misincorporation in aDNA (Neukamm et al., 2021). Tools such as 
PMD tools were also used to describe damage patterns by identifying damage sequences 
that made decontamination of the aDNA possible (Skoglund et al., 2014). Current 
advancement in algorithmic tools, however, have allowed researchers to estimate exog- 
enous contamination in human mtDNA analysis in ancient samples. These tools, Con- 
tamLD (Nakatsuka et al., 2020) and AuthentiCT (Peyrégne & Peter, 2020) coupled 
with previously designed tool, schmutzi (Renaud et al., 2015) have enabled scientists 
to reconstruct the endogenous consensus sequences with more accurately. Haplocheck 


Application of NGS in maternal genome analysis in ancient human remains 


is another computational tool that employs mitochondrial phylogeny to differentiate be- 
tween contaminated and endogenous DNA (Weissensteiner et al., 2021). 

Some of the other important methodological contributions in aDNA analysis include 
the development of precise algorithms for the taxonomic assignment of aDNA data 
(Cribdon et al., 2020). The reconstruction of methylation signatures from DNA damage 
in ancient plant remains (Wagner et al., 2020) and the characterization of ancient ge- 
nomes and methylomes using improved reference alignment of DNA obtained from 
ancient samples (Poullet & Orlando, 2020). Moreover, a laboratory information manage- 
ment system known as CASCADE has been developed for archiving and storing exper- 
imental data, including information related to preparation, extraction protocols, library 
construction, amplification, and sequencing of aDNA (Dolle et al., 2020). Recently, sci- 
entists have used a multiple-taxon approach to successfully integrate genomic data from 
multiple taxa to provide a more clear the understanding of interaction of early humans 
with animals and the environment (Dussex et al., 2021). 

The present techniques available for aDNA research have significantly increased the 
accessibility of aDNA and opened doors for novel research. However, more research is 
needed to develop and optimize protocols for every step of aDNA analysis, especially in 
the area of hybridization enrichment. Some of the applications require available sequence 
data to become more precise and accurate, for example, mapping of NGS reads to a refer- 
ence sequence; therefore, more and more sequence information needs to be collected 
from ancient samples. The computational tools also need to be improved to analyze 
and compare the huge amount of sequence data retrieved from ancient samples. Given 
these technological advancements and developments the direction of aDNA research 
is going to change radically in the coming years. 


Future perspective 


Maternal genome analysis has been proven an excellent source in forensic investiga- 
tional and evolutionary studies where the sample is highly degraded and other genomic 
markers, such as autosomal DNA markers or paternal DNA markers fail to generate 
readable data. Analysis of the aDNA has been a source to get insight into the events 
of the past, such as migratory events, historically significant events, and human evolu- 
tion. Advances in sequencing technology and the use of high-throughput techniques 
such as NGS have revolutionized the development. Now a bulk of data can be ob- 
tained and analyzed with much ease. Despite such developments, there is still room 
for development. Methods of ancient sample processing must be further improved to 
lower the permanent damage to the unique sample. Further, genomic DNA extraction 
must be optimized to enhance the yield of mt DNA. The amount of reagents used must 
be studied, and their role in noise production and experiment failure in NGS data anal- 
ysis must be determined. Probe design for mtDNA capturing can be optimized for the 
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mtDNA control region, and mt DNA STRs can also be targeted for capturing mtDNA 
probes. 

Every day, research has increased the understanding of haplotypes and haplogroup- 
ings of mt DNA. More knowledge in the field has great potential for getting a compre- 
hensive understanding of human history. The formation of a database for aDNA data 
obtained through maternal genome analysis using NGS may have applications in unre- 
vealing interand intrapopulation similarities and differences and linking that knowledge 
to historical migrations. Such information can also facilitate delineating population- 
specific mtDNA signatures that might have a potential role in forensic casework or in 
the design of population-specific medication for diseases. 
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Introduction 
Parentage and kinship testing through history 


Throughout history, the question of biological kinship has been of crucial significance for 
human societies. From the mythical origin of Greek goods, some of whom, like Heracles 
or Theseus, had a mortal mother and two putative fathers (one mortal and one divine), to 
sacred texts of many religions, examples of questionable parentage and kinship are in 
abundance. 

Over the years, various methods have been developed to resolve disputed parentage. 
In probably the most famous maternity dispute, King Solomon employed a psychological 
test of maternal love to identify the true mother of a child. As the story goes (1 Kings 3: 
16—28) two women came to the king and stood before him claiming to be mothers to a 
newborn child. The King ordered to cut the living child in two and give half to one anda 
half to the other woman. Solomon’s thinking was that the true mother of the baby would 
rather sacrifice her child’s custody than her child’s life. This is exactly what happened, and 
the King awarded the custody of the child to the woman, who, upon hearing the King’s 
verdict, cried: “Please, my lord, give her the living baby! Don’t kill him!” 

Before the advent of modern serological and DNA-based methods in the 20th century 
parentage disputes were settled based on morphological, anthropological, and medical ev- 
idence. The exclusion of paternity was based on evidence of presumable impotence or 
ejaculation dysfunction of a potential father. The length of the gestation period, abnormal 
morphology of reproductive organs, various anthropological tests, and even the presence 
of certain characteristics in the child thought to be inherited from the alleged father (AF) 
were used as irrefutable evidence to prove or disprove questionable paternity. 

The discovery of spermatozoa by Johan Ham and Antonie van Leeuwenhoek of Delft 
in 1677 (Albrecht & Schultheiss, 2004) and the human egg by Karl Ernst von Baer in 1827, 
followed by the observation by Oscar and Richard Hertwig in 1875—78 of the fusion of 
the egg’s and spermatozoon’s nuclei explained the nature of fertilization. These advances 
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in embryology together with the discovery in the 1860s of the laws of heredity by Gregor 
Mendel laid a firm scientific foundation for modern parentage and kinship analysis. 

The twentieth century saw the introduction of scientific methods that created 
contemporary genetic analysis. The first major step in the development of parentage 
and kinship testing based on the principles of heredity was the discovery of the blood 
groups by Karl Landsteiner in 1901 (Landsteiner, 1901). This was followed by the funda- 
mental work by Von Dungern and Hirschfeld (1910) on the mendelian inheritance of the 
ABO blood groups, leading to their application to paternity suits by Reuben Ottenberg 
(Ottenberg, 1921) and marking the beginning of the era of forensic genetics. The MN 
and the Rh systems were added in 1927 and 1940, respectively. These three systems 
combined could exclude a wrongly accused male 50% of the time (Shaw, 1983). 
Now, decisions as to the true parentage of a child could be based on the results of mo- 
lecular marker analysis considered within the framework of the laws of genetics. By 1974, 
there were 57 genetic systems (25 immunological and 32 biochemical) that could be used 
for parentage analysis (Chakraborty et al., 1974). For the first time, it became possible to 
achieve a probability of exclusion (PE) of 99% (Shaw, 1983) and calculate probabilities of 
paternity in nonexclusion cases. 

The application of DNA analysis to paternity and kinship cases could be traced to the 
discovery of restriction enzymes and restriction fragment length polymorphism in the late 
1970s, but the real breakthrough was done through the invention of the process of DNA 
fingerprinting by Sir Alec Jeffreys at the beginning of the 1980s (Gill et al., 1985) and its 
first application to resolve an immigration paternity dispute (Jeffries et al., 1985). The 
“Eureka moment” for the application of DNA analysis for human identity testing was 
the recognition by Sir Alec that repeating sequences, which are abundant in the human 
genome, are highly variable and informative genetic markers that can provide accurate 
individualizing information. Among the various types of repeat sequences found in the 
human genome microsatellites, known as short tandem repeats (STR) were established 
in the early 1990s as the main marker type for parentage and kinship analysis. 

Approximately at the same time, DNA typing was augmented by the recently devel- 
oped in vitro DNA amplification technology called the polymerase chain reaction (PCR) 
(Mullis et al., 1986). Combining STR markers with PCR amplification allowed the anal- 
ysis of extremely small amounts of DNA (often as little as 10—15 pg) as well as DNA of 
various degrees of degradation. No more did DNA genotyping need to rely on the avail- 
ability of large amounts of good-quality DNA, usually obtained from fresh blood. Blood 
and saliva stains, various tissue samples, buccal swabs, and even hair roots became routine 
sources of DNA for testing. 

Currently, all parentage and kinship tests are regularly resolved by assessing 11—25 
STR loci. According to the latest (2013) available report by the Association for the 
Advancement of Blood and Biotherapies (former American Association of Blood Banks, 
AABB) more than 99.9% ofall paternity and kinship tests in the USA are performed using 
STR markers. 
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Principles of parentage and kinship analysis 


Parentage analysis is based on the simple presumption that in the absence of mutation, a 
child must receive one allele matching each parent at every locus tested (Tracey, 2001). 
This allele is called an “obligate allele”. When for all the molecular markers tested the 
child (C) and the alleged parent (AP) share matching obligate alleles, the parentage 
cannot be excluded and the weight of evidence in favor of the presumed parentage is 
calculated, taking into the account allelic frequencies of the markers analyzed. This 
weight of evidence is known as the combined parentage index (CPI). When a genetic 
inconsistency (a mismatch between the genotypes of AP and C) is observed it may be 
an indication of parentage exclusion; however, the possibility of the mismatch being 
caused by mutation must always be considered (Gjertson et al., 2007). If the CPI value 
of > 10,000 is reached the genetic evidence is thought to be in favor of parentage, and 
the mismatched alleles (if any) are presumed to be mutations (Li et al., 2017). Otherwise, 
analysis of additional markers is required. Mismatches of one or two alleles are commonly 
observed in parentage testing. Taking mutation events into account the CPI when two 
mutations are suspected is very low, bordering on single digits. When three inconsis- 
tencies have been detected, the CPI is usually close to zero, and this is deemed sufficient 
to issue a paternity exclusion report (Gjertson et al., 2007). At the same time, nonexclu- 
sion cases with three mutations have been described (Jacewicz et al., 2004), and if such a 
situation is suspected, a major effort to find and test additional markers is required in order 
not to come to a false-negative conclusion. 

Kinship testing, other than parentage, is based on a different presumption. Here, no 
obligate alleles are considered, but instead, a probability that alleles identical by state (IBS) 
between the individuals tested are also identical by descent (IBD) (i.e., originate from a 
common relative(s) of these individuals) as opposed to being identical by chance (i.e., 
inherited from unrelated ancestors) is evaluated. The statistical evidence in kinship testing 
is less strong than in parentage analysis. It is theoretically impossible to exclude kinship 
even when inconsistencies are observed at each locus tested. Whereas in most cases, 
parentage exclusion can be stated with more or less certainty, unlikely kinship is stated 
as having a low probability, and likely kinship is stated as having a high probability 
(Wenk, 2004). Even though modern STR multiplexes contain 25 loci or more, it is 
sometimes problematic to determine kinship between first-degree relatives (full siblings) 
and even more difficult between second-degree relatives (half-siblings, grandparent- 
grandchild, avuncular relationship, etc). 


Samples used for parentage and kinship analysis 


Until the mid-1990s, blood was the main sample for parentage and kinship investigations. 
These days, with the introduction of PCR and the development of capillary electropho- 
resis (CE), buccal (mouth) swabs have superseded blood as the sample of choice. Accord- 
ing to the latest available report of the AABB Relationship Testing Standards Program 
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Unit (2013), buccal swabs constituted more than 99.5% of all samples used for relation- 
ship testing in the USA. Buccal swabs provide good-quality DNA and generally do not 
cause complications in the analysis and interpretation of the results. However, in certain 
cases, samples from decomposed bodies, histological samples, or pregnancy termination 
products have to be used for DNA analysis. These samples are not ideal and pose consid- 
erable technical challenges, which often affect the analysis and interpretation of the data. 
DNA from decomposed bodies and histological samples is commonly of inadequate 
quality. The fixation process used for the preparation of histological samples and their 
subsequent storage often compromises the integrity of DNA. A similar effect on the qual- 
ity of DNA has the process happening during body decomposition. Genotyping DNA 
from pregnancy termination products could result in a mixed profile containing alleles 
from both the mother and the fetus, as it is often impossible to separate fetal and maternal 
tissue, especially when pregnancy is terminated at an early stage. (Semikhodskii, 2007). 


DNA markers used for parentage and kinship analysis 


STRs are currently the genetic markers of choice for parentage and kinship testing. In 
cases where an increase in discrimination power (DP) is needed to enhance the resolution 
of the analysis, it can be achieved by complementing the original STR marker set by 
other markers, e.g., single nucleotide polymorphisms (SNPs) or insertion/deletion poly- 
morphisms (InDels) (Kayser & Knijff de, 2011; Phillips et al., 2012). 


Short tandem repeat markers 

STRs (microsatellites) are DNA sequences made up ofa unit that usually varies in length 
from 2 to 7 bp and is repeated many times. Although thousands of STRs have been found 
in the human genome only a small subset of them is actually used for DNA identification 
and kinship analysis purposes (Butler, 2006). Forensic STR multiplexes commonly 
employ tetra-nucleotide STRs, which improve the efficiency of PCR amplification 
and minimize spurious DNA peaks, although penta-nucleotide STRs, like Penta E and 
Penta D, are also included in some STR marker panels. There are simple and complex 
STR loci. In simple STRs, the repeat units are constant throughout the repeat region. 
In contrast, complex STRs vary in their motif composition (tri, tetra-, and penta- 
nucleotide repeats) and may contain sequence polymorphisms within individual repeat 
units (Tilanus, 2006). 

STR-based kinship analysis consists of generating PCR amplicons of 100—500 bp us- 
ing one of several commercially available STR multiplexes, e.g., PowerPlex Fusion 
(Promega, USA) or VeriFiler Plus (Thermo Fisher, USA), followed by separation of 
the products of amplification by CE on an Applied Biosystems 3500 Genetic Analyzer 
(Thermo Fisher, USA) or a similar high-throughput CE instrument. 

Despite being the main markers for human identification and relatedness testing, 
STRs suffer from several problems that complicate their use. The relatively large size 
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of PCR amplicons (100—500 bp) presents an issue when dealing with degraded DNA. 
Although this can be mitigated by employing reduced-size amplicon mini-STRs (ampli- 
con size 60—200 bp), it is still a challenge when fixed histological preparations or samples 
from decomposed bodies are presented for analysis. As for noninvasive prenatal paternity 
testing (NIPPT) STRs have very limited use due to the nature of the circulated fetal 
DNA in the mother’s bloodstream (see below). 

Another disadvantage of STR markers is their increased mutation rate (Tracey, 2001), 
which often causes a mismatch between the AP and the child. Mismatches in parentage 
cases due to STR mutations complicate analysis and sometimes lead to false parentage 
exclusions. When a mutation event is suspected typing supplementary STRs or the anal- 
ysis of samples from other family members may be required to correctly interpret the 
genotyping data. 

STRs have limited information content per locus (Wenk, 2004), thus 15—20 markers 
or more are needed to achieve convincing levels of the CPI. Because of this, supplemen- 
tary markers are often required to increase the DP, especially when a close relative is pre- 
sumed to be the alternative parent or when a relationship more distant than the second 
degree needs to be investigated (Nothengel et al., 2010; Phillips et al., 2008). In such 
cases, additional STR marker panels need to be employed to find polymorphic markers 
to supplement the originally chosen set. This approach is costly, time-consuming, and 
not very efficient, as most commercially available multiplexes usually have many STRs 
in common and differ only in very few loci. 


Single nucleotide polymorphism markers 

SNPs are the most common type of genetic variation among humans. An SNP represents 
a difference in a single nucleotide between alleles. Most SNPs used for human identifi- 
cation purposes are diallelic, but recently tri-allelic SNPs have been shown to be foren- 
sically promising (Phillips et al., 2020). Compared with STRs, SNPs perform better 
when analyzing degraded samples (Kayser & Kniff de, 2011); they have a significantly 
higher number of available loci (Borsting & Morling, 2015) and a significantly lower mu- 
tation rate (Schwark et al., 2012). One major disadvantage of SNPs in parentage and 
kinship testing is their low discriminative power, which is offset by increasing the number 
of typed SNPs. 


Insertion-deletion polymorphism markers 

InDels are DNA length variation markers characterized by the presence or absence (inser- 
tion or deletion) of a certain DNA sequence (typically from 1 to 50 bp). They are dia- 
Nelic, with great length variation between alleles, have a lower mutation rate in 
comparison to STRs, and are widely distributed throughout the genome, constituting 
about 20% of human genetic variation (Santos et al., 2010). Like SNPs, InDels have a 
low DP (Cortellini et al., 2020) and are mostly used as supplementary markers to STRs. 
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Microhaplotype markers 

Microhaplotypes (MHs) are a novel type of molecular marker with a broad range of uses 
in forensics, parentage and kinship testing, missing person identification, and medical ge- 
netics. An MH locus is a short (less than 300 bp) segment of DNA characterized by the 
presence of two or more closely linked SNPs that have at least three allelic combinations 
(1.e., “haplotypes”) (Oldoni et al., 2019). Because of the short distance between the SNPs 
each MH is thought to be inherited as a single block that is being passed over from gen- 
eration to generation (Kidd et al., 2014). 

Being multiSNP markers MHs provide, on a per locus basis, a higher DP than indi- 
vidual SNPs. A big advantage of MHs over STRs is that the absence of repetitive regions 
removes the problem of stutters caused by DNA polymerase slippage during amplifica- 
tion (Staadig & Tillmar, 2021). 

Due to the low mutation rate of SNPs, MHs can complement standard STRs in 
parentage and kinship tests, especially in deficient family pedigrees (Oldoni et al., 
2019), or used on their own (Staadig & Tillmar, 2021). 


Genotyping technology 

Capillary electrophoresis 

CE is currently a standard method for STR and InDel analysis and can also be used for 
SNP genotyping. STR and InDel alleles are identified by the length of a PCR-amplified 
fragment (length-based polymorphism) and are called relative to known-sized allelic lad- 
ders, with a number value indicating the number of complete repeats observed. 

Many STRs exhibit variation within the repeat region, having alleles equal in length 
but different in sequence. Such alleles are called isoalleles (Gettings et al., 2015). Some 
STR loci (e.g., D21S11, D8S1179, VWA, and several others) contain almost twice as 
many alleles differing in sequence than those differing in length (Gettings et al., 2015). 
A major disadvantage of the CE-based genotyping approach is that this technology 
does not allow the detection of any sequence variation between STR alleles, which 
may assist in making a decision as to the parental origin of the allele or even indicate a 
genetic mismatch. The resulting loss of information decreases the DP of the STR gen- 
otyping panel used for testing. 

Another significant drawback of the CE genotyping technology is its limited multi- 
plex capability due to the small number of dye channels available for simultaneous anal- 
ysis (Grandell et al., 2016). Because amplicons of similar sizes have to be labeled with 
different fluorophores, the number of loci multiplexed together cannot exceed the num- 
ber of dye channels in a genetic analyzer, which is 6 for the ABI 3500 family of instru- 
ments (ThermoFisher, USA) or 6-8 for the Spectrum family of instruments (Promega, 
USA). 

The introduction of next-generation sequencing (NGS) within the framework of 
parentage and kinship testing provides solutions for the above limitations. 
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Next-generation sequencing 

Second-generation sequencing, or NGS, as it is commonly known, was developed in 
the 1990s and became commercially available in 2005. The term “next generation” im- 
plies the next step in the development of DNA sequencing technology from the first 
generation (Sanger sequencing). Second-generation sequencing methods can be group- 
ed into two major categories: sequencing by hybridization, using DNA microarrays, and 
sequencing by synthesis (SBS)—pyrosequencing, sequencing by ligation, ion and nano- 
pore semiconductor-based technology (Ion Torrent, Thermo Fisher, USA, Oxford 
Nanopore, Oxford, UK, SBX, Stratos Genomics, Seattle, USA), and bridge amplifica- 
tion technology developed by Illumina (USA) (Slatko et al., 2018). All these technolo- 
gies share satisfactory accuracy, high throughput, and a lower cost per base compared to 
Sanger sequencing (Liu et al., 2012). SBS is mainly used in medical genetics for diag- 
nostic applications such as identifying disease-related SNPs or for molecular karyotyping 
(Slatko et al., 2018). NGS based on SBS has both medical and forensic applications. This 
is the type of NGS used for paternity and kinship testing. Bruijns et al. (2018) provide an 
excellent review of molecular chemistry and the technologies of various NGS platforms. 

All the SBS technologies rely on high sequence coverage achieved by massive parallel 
sequencing (MPS), where millions of reads from a DNA fragment produced at the same 
time are detected on the same chip or flow cell in parallel (hence the name) (Saiz et al., 
2020). Such a high number of sequencing runs is necessary to compensate for the 
elevated error rate of NGS relative to Sanger sequencing. After MPS, an accurate 
sequence of a DNA fragment is obtained as a consensus sequence of all the million reads 
of the same fragment (Slatko et al., 2018). 

The MPS approach has several principal advantages over CE, with important impli- 
cations for parentage and kinship analysis. One of the crucial features of MPS is the high 
level of multiplexing, both in the number of molecular markers and in the number of 
samples. Because MPS does not require size separation between amplicons it overcomes 
the multiplexing issue of current CE methods and allows typing a significantly larger 
number of DNA markers (not only STRs but also SNPs, InDels, MH, and mtDNA) 
simultaneously in a single assay (Daniel et al., 2015). In addition, many samples can be 
multiplexed in a single sequencing run by using barcoding, which further increases the 
throughput (Grandell et al., 2016). Also, because NGS requires much smaller amplicons 
than the CE-based STR analysis it allows obtaining genetic information from badly 
degraded genomic DNA. 

In length-based CE genotyping, only the size of the allele is taken into consideration. 
In contrast, MPS could detect not only fragment length polymorphisms but also obtain 
detailed sequencing information about each allele analyzed (Gettings et al., 2016). This 
allows the discovery of previously unknown alleles and mutational events and the inves- 
tigation of isoalleles and their parental origin (Rockenbauer et al., 2014), thus helping to 
resolve many issues affecting parentage and kinship analysis pertinent to CE. An additional 
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bonus of having sequence information is that SNPs and InDels are often found in the 
repeated units of STR loci or in the flanking regions (Rockenbauer et al., 2014), further 
increasing the DP of the sequence-based testing compared to conventional CE genotyp- 
ing. In some cases, having sequence information may even permit differentiating homo- 
zygous twins (Weber-Lehmann et al., 2014). Because sequence information allows the 
differentiation of isoalleles, their origin can be determined more precisely. As a result, cor- 
rect formulas for calculating relatedness statistics can be employed, increasing the accuracy 
of the computation. 

In parentage testing, a major advantage of MPS over CE is that it permits a much 
deeper investigation of the nature of genetic inconsistencies. When an inconsistency is 
observed, CE can only determine the size of the alleles. NGS analysis goes further by 
obtaining information about every nucleotide in the PCR amplicon. In addition to 
length, it makes it possible to investigate the structure of the motif composition, sequence 
polymorphism within individual repeat units, and the flanking regions of the alleles. For 
example, a mutation or another event in the core repeat could change the size of the 
allele, or insertions/deletions in the flanking regions could affect primer binding (null al- 
leles) and also the size of the allele (Silva et al., 2018). 

Crucially, genotyping data obtained by MPS could be seamlessly compared to those 
obtained by CE-based analysis. Current CE-based allelic nomenclature conventions for 
STR and other length-based markers can be used to assign alleles obtained by MPS 
typing (Devesse et al., 2018), and the current length-based allelic frequency database 
could be used for calculating parentage and kinship statistics in MPS-based tests. 

One major issue that prevents the wider adoption of NGS for routine parentage and 
kinship testing is the much higher cost per sample compared to the standard CE analysis 
(Alonso et al., 2018). Having said that, the increased cost of NGS analysis can be offset by 
the ability of this technology to deal better than CE with degraded samples, by high mul- 
tiplexing, and by providing sequence information. Further development of NGS will, 
undoubtedly, facilitate the reduction of the cost, making it commercially competitive 
with the currently CE-based approach. 


Commercially available NGS kits for parentage and kinship testing 

A number of NGS platforms and kits for simultaneous analysis of forensically important 
DNA markers have been developed during the past 15 years (Saiz et al., 2020; Li et al., 
2021). Currently, the Illumina MiSeq/Verogen MiSeq FGx (Illumina/Verogen, USA) 
and Ion GeneStudio S5 systems (ThermoFisher, USA) are the most popular instruments 
for forensic identity and relatedness testing. Recently, MGI developed a novel NGS 
sequencer, MGISEQ-2000 (MGI Tech, Co., Ltd., China), which promises to be an 
interesting alternative to these two systems. Available NGS platforms for forensic iden- 
tification and parentage and kinship applications are discussed in detail elsewhere in 
this handbook. 
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The currently available commercial NGS multiplexes for forensic and relatedness 
purposes and the platforms for their analysis are presented in Table 12.1. The kits listed 
in the table can be loosely subdivided into three categories: (1) Kits for analysis of “stan- 
dard” forensic STR markers (both autosomal and gonosmal); (2) Kits for analysis of “stan- 
dard” forensic STR markers supplemented with novel or less used STRs, SNPs, and 
sometimes mtDNA; and (3) Kits only for analysis of SNP markers and mtDNA 
sequencing. Multiplexes of the first and second groups are generally used for parentage 
and kinship analysis, whereas those from the last group are mainly used as an auxiliary 
tool when more markers are needed to form an opinion as to the relationship in question. 

The STR-containing multiplexes include loci that are part of the Combined DNA 
Index System (CODIS) and the European Standard Set, supplemented by Y-STR and 
X-STR, and amelogenin (Amel). Supplementary SNPs usually contain the identity 
and ancestry informative markers and even phenotypic informative loci. Some of the 
SNP-only systems have been developed to assist in specific kinship investigations. For 
example, the ForenSeq Kintelligence Kit targets 10,230 SNP markers and is aimed at 
forensic genetic genealogy research, whereas the MGIEasy Pa-SNPs Genotyping Kit is 
specifically designed for use in NIPPT, paternity testing, and individual identification. 

Most commercially available multiplexes have been designed to be used only with a 
particular MPS platform. PowerSeq 46GY and ForenSeq DNA Signature Kit are 
compatible only with MiSeq systems (Illumina/Verogen, USA), whereas Precision ID 
GlobalFiler NGS STR Panel v2e and MGIEasy Signature Identification Library Prep 
Kit have been designed exclusively for use on Ion S5 systems (ThermoFisher Scientific, 
USA) and the MGISEQ-2000 (MGI Tech. Co., Ltd.), respectively. A big advantage of 
libraries prepared with the QJAseq family of multiplexes (Qiagen, USA) is that they can 
be sequenced on both MiSeq and S5 systems. 


Parentage testing 


Depending on when the child’s sample is collected, testing for parentage can be either 
postnatal, when the sample is collected after the birth of the child, or prenatal when 
the sample is collected while the child is still in utero. The latter can be further subdivided 
into invasive and noninvasive depending on the type of child’s biological sample used for 
analysis. STR markers are generally used for postnatal and invasive prenatal tests. Because 
of the nature of the DNA used in a child’s sample SNPs are currently the markers of 
choice for noninvasive prenatal testing. 


Postnatal parentage testing 


The main issue in parentage testing is to differentiate whether an observed mismatch at 
the obligate allele is due to parentage exclusion or caused by mutation. Because of 
elevated mutation rates of STR loci (Tracey, 2001) mismatches of one or two alleles 
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Kit name 


PowerSeq 46GY" 

PowerSeq CRM Nested System 

ForenSeq DNA signature kit 

ForenSeq MainstAY kit 

ForenSeq mtDNA Whole 
Genome kit 

ForenSeq mtDNA Control 
Region kit 

ForenSeq kintelligence kit 

Precision ID GlobalFiler NGS 
STR Panel v2" 


MGIEasy signature Identification 
library prep kit 

MGIEasy mtDNA Whole 
Genome Amplification kit® 

MGIEasy Pa-SNPs Genotyping 
kit” 

QIAseq investigator missing 
persons SNP panel’ 

QIAseq investigator ID SNP 
panel’ 

QIAseq Human Mitochondria 
panel! 

QIAseq investigator 
microhaplotype panel 

QIAseq investigator Human 
Mitochondria Control Region 
panel” 


Manufacturer 


Promega, USA 
Promega, USA 
Verogen, USA 
Verogen, USA 
Verogen, USA 


Verogen, USA 


Verogen, USA 
ThermoFisher, 
USA 


MGI Tech Co., 
China 

MGI Tech Co., 
China 

MGI Tech Co., 
China 

Qiagen, USA 


Qiagen, USA 


Qiagen, USA 


Qiagen, USA 


Qiagen, USA 


Instrument 


MiSeq 
MiSeq 
MiSeq FGx 
MiSeq FGx 
MiSeq FGx 


MiSeq FGx 
MiSeq FGx 
HID ion $5 HID ion 
GeneStudioS5 
system 
MGISEQ-2000 
DNBSEQ-G400RS 
DNBSEQ-G50 
DNBSEQ-G400 
MiSeq ion S5 system 
MiSeq ion S5 system 
MiSeq ion S5 system 
MiSeq ion S5 system 


MiSeq ion S5 system 


Autosomal 
STR 


“The loci included are Combined DNA Index System (CODIS) core and European Standard Set (ESS) loci, amelogenin and the Y-STR loci included in the PowerPlex 


Y23 System. 
®mtDNA control region. 


“94 identity-informative SNPs, 56 ancestry-informative SNPs, and 22 phenotypic-informative SNPs. 


“Full (16,569 bp) mitochondrial genome. 


“Includes 21 CODIS STR markers, 9 additional multiallelic STR markers, and 4 sex determination markers. 


h 


microhaplotypes. 


on 


145 identity-informative SNPs, 53 ancestry-informative SNPs, and 29 phenotypic-informative SNPs. 
Seven amplicons covering the whole hypervariable region) of mtDNA. 
The kit is specifically designed for use in noninvasive prenatal paternity testing, paternity testing, and individual identification. 

"The loci include 1200 tri-allelic kinship SNPs, 34 X-chromosome SNPs, and 55 ancestry-informative SNPs. Furthermore, the panel includes a set of 46 


The panel is specifically designed to provide additional discrimination power to parentage and kinship tests. 
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are not uncommon in parentage testing, and nonexclusion cases with three mutations 
have also been described (Jacewicz et al., 2004). Information obtained by CE is often 
not enough to decide whether or not the genetic inconsistency is caused by mutation, 
and for clarification of the issue analysis of more markers and/or additional family mem- 
bers is required. The introduction of MPS into the framework of parentage testing makes 
it possible to compensate for this and other problems associated with CE-based STR 
typing. The sequence information obtained by MPS permits investigating the true vari- 
ation within STR loci, resolves the question of whether or not an observed length-based 
inconsistency is due to mutation, and determines the specific paternal and maternal 
modes of inheritance of the allele in question. 

Liand coworkers (Li et al., 2017) evaluated the performance of the Precision ID Global- 
Filer NGS panel (Thermo Fisher, USA), containing 30 informative STRs and Amel, for 
patemity testing of a total of 6 paternity duos and one paternity trio, all with mismatched 
STR loci. In five cases, sequencing data from STR motifs was enough to determine that the 
observed genetic inconsistency was due to a mutation inherited from the father, while in 
the other two cases, SNP data from the flanking regions had to be additionally used to 
determine which allele of the AF was the source of the mutation. Compared to CE gen- 
otyping the total number of alleles observed by MPS significantly increased (176 vs. 193) as 
sequencing information allowed distinguishing isoalleles. These extra alleles were observed 
in 9 out of the 30 informative STRs. Other studies likewise report an increase in the num- 
ber of STR alleles observed by MPS typing compared to the length-based analysis (Ma 
et al., 2016; Liet al., 2019; Silva et al., 2018). The expanded range of STR loci and a large 
number of SNP markers included in commercially available or custom-made MPS panels 
make it possible to resolve parentage without the need for repeat analysis using additional 
marker panels or extra family members (Ma et al., 2016). 

The usefulness of MPS for tracing the origin of parental mutations was also shown in 
another MPS-based parentage study using the ForenSeq DNA Signature Prep Kit (Ilu- 
mina, USA). Three mutations were observed in 81 parent-offspring pairs for autosomal 
STRs, and two mutations were observed in 26 father-son pairs for Y-STRs. By analyzing 
sequence information, the researchers could determine the sequence of each of these alleles 
and how it changed during DNA transmission for parent-offspring pairs (Li et al., 2019). 

Isoalleles represent a particular issue in CE-STR parentage analysis as they exhibit no 
length variation between the parental and filial alleles. An isoallelic mismatch will be 
missed by the length-based typing, causing the wrong results to be reported. Only by us- 
ing MPS the sequence variation can be detected, confirming the genetic inconsistency (Li 
et al., 2017; Ma et al., 2016). 

The CPI values obtained by MPS genotyping are generally higher than those from 
the length-based analysis. In a study using ForenSeq DNA Signature Prep Kit (Illumina, 
USA), the average parentage index (PI) obtained by sequence-based analysis for the STR 
loci included in the panel (log7g PI = 11.52) was significantly higher than that observed 
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by the length-based testing for the same array of loci (log;g PI = 9.20). When both STR 
and SNP loci included in this panel were combined, the average PI values with length- 
based testing were several orders of magnitude higher than those observed for only 
sequence-based STR genotyping (logyg PI= 11.52 and 8.20, respectively, vs. logio 
PI = 17.39) (Li et al., 2019). 

Overall, MPS offers several key advantages over CE-STR analysis in resolving 
parentage cases. By capturing sequence information MPS allows the identification of a 
higher number of alleles than those seen by the CE-based genotyping approach. 
Sequence information produced by this technology permits not only differentiation be- 
tween a mutation or exclusion but, crucially, determining the parental origin of the allele 
in question. All this, together with the ability to test large numbers of various types of 
DNA markers makes MPS typing more powerful for resolving parentage suites than 
the current CE-based methods. 


The use of MPS in detecting uniparental disomy in parentage testing 


MPS is an indispensable tool when parentage testing is complicated by the presence of a 
genetic anomaly in the child. One such anomaly that can directly affect the results of 
parentage testing is uniparental disomy (UPD). UPD is a rare and fascinating phenome- 
non manifested by the presence of a pair of homological chromosomes exceptionally 
derived from one parent, either from the father or from the mother (Engel, 1980). 
Most UPDs appear to be of no phenotypic consequences to the carrier (Kotzot, 
2002), although some were found to be the reason for several important medical condi- 
tions and syndromes, such as Prader-Willy syndrome, Angelman syndrome, and transient 
neonatal diabetes among others. In some cases, UPDs can be the cause of homozygosity 
for recessive genes, leading to autosomal-recessive disorders. There are many types of 
UPDs: they can be constitutional or acquired, affect the whole chromosome (complete) 
or only a particular section (segmented), the chromosomal pair can be of paternal or 
maternal origin, and chromosomes derived from one parent may be identical (isodisomy) 
or different (heterodisomy) (Liehr, 2014). 

Apart from their significance in medical genetics, UPDs often pose a problem in 
parentage testing, complicating distinguishing between a true exclusion and a false 
exclusion due to abnormal chromosomal inheritance (Mansuet-Lupo et al., 2009). 
The presence of UPD, especially in asymptomatic patients can be the reason for pseu- 
doexclusion of parentage, especially in the presence of another, unrelated, mismatch 
(Kerr et al., 2018). The problem is particularly manifested in deficiency cases when a 
DNA sample of the undisputed parent is not available for analysis. Pseudoexclusions 
of maternity (Wegener et al., 2006) or paternity (Bein et al., 1998) due to the presence 
of UPD have been documented. The presence of a segmented UPD additionally com- 
plicates establishing true parentage, due to the discontinuous nature of this genetic 
abnormality. 
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When discordant results are observed and a UPD is suspected, STR length-based 
profiling alone is often not sufficient to confirm the presence of this chromosomal anom- 
aly in the child. The power of MPS solves the issue in most cases. The sequence infor- 
mation of the alleles potentially affected by UPD can point to their parent of origin. The 
ability to type additional markers on the chromosome, such as SNPs, can not only pro- 
vide information as to the parental origin of the chromosome (Chen et al., 2019; Su et al., 
2020; Zhang et al., 2019) but also reveal whether the UPD affects the whole chromo- 
some or only a particular fragment (Su et al., 2020). 

This unusual type of chromosomal inheritance should always be considered when 
discordant parentage results are obtained. Regrettably, it is uncommon for parentage 
testing laboratories to perform additional testing, especially when the number of genetic 
inconsistencies observed is 1—3. Because paternity testing laboratories only in particular 
cases get information about a child’s medical history and that this genetic abnormality is 
rare and predominantly asymptomatic, UPD can often be missed, and alleles affected by it 
are mistakenly treated as true genetic inconsistencies, causing pseudoexclusion of 
parentage. The problem of UPD identification is exacerbated by the relatively small 
number of genetic markers included in commercially available STR multiplexes and 
the inadequacy of the current CE genotyping technology for detecting UPDs. The po- 
wer of MPS applied in parentage analysis provides the tools to further investigate the 
initial allelic mismatches, greatly facilitating the task of UPD identification and mini- 
mizing false parentage exclusions. 


Noninvasive prenatal paternity testing 


In some situations, it is necessary to perform a paternity test before the child is bom. If the 
pregnancy is suspected to have resulted from a sexual assault or a proscribed relationship, 
confirmation of the identity of the biological father allows the pregnant woman to make 
an informed decision as to the termination or progress of the pregnancy. Until recently, 
paternity testing of an unborn child relied on invasive procedures such as amniocentesis, 
performed at gestational week (GW) 15—24, or chorionic villus sampling, performed at 
GW 8-12. These procedures are stressful, associated with various pregnancy-related 
complications, pose a serious risk of intrauterine infection and damage to the fetus, 
and have a chance of miscarriage of about 1% (Akolekar et al., 2015). In the early 
2010s, a new, noninvasive methodology of paternity testing was developed. This 
approach, called NIPPT, is based on the analysis of circulating cell-free fetal DNA 
(ccffDNA) in maternal blood. 

The ccff DNA was discovered in 1997 when Dennis Lo (Lo et al., 1997) reported the 
presence of cell-free fetal DNA in the blood of women during pregnancy. This type of 
DNA mainly originates from trophoblast cells undergoing apoptosis (Jackson, 2003). 
Other sources of ccff{DNA are the transfer of fetal DNA through the placenta and con- 
stant placental expansion during gestation (Sekizawa et al., 2003). The relative amounts 
of ccff DNA vary across pregnancies and gestational week ages, reaching as high as 40% 
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and as reported as low as 1% (Muzzey, 2018), but generally, they range between 10% and 
20% toward the latter stages of pregnancy (Lun et al., 2008). The ccff{DNA first becomes 
detectable at GW 4—5 (Illanes et al., 2007) and disappears within 2 days postpartum (Lo 
et al., 1999). By GW 9-10 the amount of ccffDNA exceeds 3%—4% (Christiansen 
et al., 2019), which is sufficient for most types of noninvasive analyses, like testing for 
aneuploidies (Renga, 2018), monogenic disorders (Chang et al., 2021; Chen et al., 
2021), blood group genotyping (Alshehri & Jackson, 2021), paternity testing (Chang 
et al., 2019), and so on. 

The ccff{DNA consists of short fragments, usually less than 160 bp in length, approx- 
imately corresponding to the length of the DNA wound around a nucleosomal core unit 
(Luger et al., 1997). Placenta-derived cell-free DNA fragments are shorter in length 
(143 bp) than maternally-derived DNA fragments (166 bp) (Renga, 2018). Although 
circulating cell-free DNA in the maternal bloodstream is highly fragmented, the entire 
fetal genome could still be determined from maternal plasma (Kitzman et al., 2012). 

The small size of ccffNA fragments has an implication on the choice of genetic 
markers for NIPPT. STRs are currently the standard marker system for postnatal and 
invasive prenatal parentage testing; however, because of the small length of ccffDNA 
fragments STR alleles longer than 160 bp will not be amplified. In the early attempts 
to perform NIPPT using the AmpFLSTR Identifiler Kit (Thermo Fisher, USA), only 
several informative paternally derived STR alleles were possible to detect (Wagner 
et al., 2009). Results obtained with the ForenSeq DNA Signature Prep Kit (Illumina, 
USA) were also unsatisfactory (Shen et al., 2021). In addition, the amplification of 
STRs often generates PCR artifacts that may be mistaken for paternally inherited fetal 
alleles (Christiansen et al., 2019; Shen et al., 2021). The heavy dilution of ccffDNA 
with maternal cell-free DNA presents further technical challenges. In this situation, ge- 
netic markers based on small amplicons, like SNPs (Chang et al., 2019; Zhang et al., 
2018) and MHs (Bai et al., 2020; Ou & Qu, 2020; Wang et al., 2020), are much 
more informative than STRs when analyzing highly fragmented mixed DNA. Analysis 
of a great number of these markers by MPS allows reliably and robustly performing 
noninvasive paternity testing of an unborn child. 

The main challenge in developing an SNP-based NIPPT is the choice and accurate 
genotyping of fetal SNPs in the presence of a large amount of maternal DNA. Although 
millions of SNPs have been identified in the human genome SNPs selected for NIPPT 
must be high-frequency (Chang et al., 2019; Jiang et al., 2016) and not be associated with 
medical conditions. To be able to detect fetally-dertved SNPs, only markers for which 
the mother is homozygous and the fetus is heterozygous can be used. These are called 
“effective SNPs”. 

Information on the amount of fetal DNA in maternal plasma is required for correctly 
identifying and calling paternally inherited alleles. As only effective SNPs are used for 
calculating the weight of evidence about half of the SNP markers analyzed will be 


Application of NGS technology for parentage testing and relatedness analysis 


uninformative in a particular study (Jiang et al., 2016; Yang et al., 2017). This has impli- 
cations for the number of SNPs that need to be included in the panel to achieve the 
desired degree of confidence. 

The GW of blood collection from the pregnant mother is also crucial for a reliable test 
result. Christiansen et al. (2019) investigated the chances of identifying paternally 
inherited alleles in maternal blood using the Precision ID Identity Panel (Thermo Fisher, 
USA). They have shown that at GW7, only a moderate number of paternally inherited 
fetal SNP alleles could be detected; at GW12, paternally inherited fetal SNP alleles were 
observed in the great majority of the pregnant women; and at GW20, paternally 
inherited fetal SNP alleles were found in all pregnant women. 

Various studies have demonstrated the feasibility of detecting fetal-specific alleles in 
the blood of pregnant women and determining paternity. Jiang et al. (2016) developed 
an NIPPT based on MPS analysis of maternal plasma DNA using a custom panel con- 
taining 5x10°—8x10° high-frequency SNPs. Sequencing-based NIPPT results fully 
agreed with those obtained by invasive prenatal paternity testing using the AmpFLSTR 
Identifiler (Thermo Fisher, USA). In 17 out of 17 cases, paternity was determined 
correctly (Jiang et al., 2016). 

Likewise, Chang et al. (2019) created a custom panel containing 5457 SNPs to eval- 
uate 349 paternity cases with pregnancies of GW 6—35 and 9 negative controls from 
nonpregnant women who had fertility experience previously. Each mother-fetus pair 
was compared with all males in a blind manner. The biological fathers of all cases 
were correctly identified among the unrelated males. In negative controls, fetal SNPs 
originating from previous pregnancies could not be detected. A randomly selected subset 
of 50 SNPs from the total number of SNPs in the panel yielded a combined PE of 0.9999 
in the full trio paternity testing (Chang et al., 2019). 

In a study aimed at developing a method for NIPPT, Tam et al. (2020) employed 
MPS for sequencing a 356 target-SNP panel. Samples were collected at GW 7—20. 
Following data analysis, 108—174 target SNPs were classified as effective SNPs and 
included in the paternity calculations of 15 family trio cases, generating paternity prob- 
abilities of greater than 99.9999%, All paternity results were confirmed by CE-STR anal- 
ysis using either amniotic fluid collected at GW 16—19 or buccal swabs collected from 
the newborns postpartum. The high specificity of the methodology was validated by suc- 
cessful paternity discrimination between biological fathers and their siblings and by large 
separations between the CPIs calculated for the biological fathers and those for 60 unre- 
lated men. 

NIPPT by MPS requires sophisticated bioinformatics to analyze large quantities of 
genotyping data. There is still no uniform opinion among scientists as to which biosta- 
tistical approach is correct to ensure the paternity test returns a true result. One point 
of view is to use the PE or PI to estimate the accuracy of an NIPPT (Drabek & Cereda, 
2014; Jiang et al., 2016). The other opinion is that standard approaches based on PI are 
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not adequate to estimate paternity reliably. According to this school of thought conven- 
tional paternity testing methodologies rely on the availability of pure child DNA; how- 
ever, the ccffDNA found in maternal plasma cannot be separated from a much larger 
amount of maternal cell-free DNA, thus precluding the use of traditional methods to 
analyze NIPPT data (Ryan et al., 2014). For this purpose, special bioinformatic algo- 
rithms that take into the account amount of the fetal fraction have been proposed for 
correctly identifying fetal-specific alleles and calculating the diagnostic potency of an 
NIPPT test (Ryan et al., 2013). 

Natera Inc. (USA) was the first company to successfully commercialize the NIPPT 
service for the general public based on the analysis of more than 300,000 SNPs on a 
Human-Cyto12-SNP microarray (Ryan et al., 2013). Out of the 36,400 tests using an 
unrelated male as the AF, 99.95% correctly excluded paternity, and 0.05% were indeter- 
minate. DNA Diagnostic Center developed Certainty NIPPT based on MPS analysis of 
2688 SNPs (DNA Diagnostic Center, n.d) and maternal blood collection beginning from 
GW 10. Currently, this is the only NIPPT approved by the AABB. A number of other 
genetic testing providers also offer NIPPT as a service. 

In contrast to many commercially available STR multiplexes for CE-based genotyp- 
ing only one MPS-based panel, the MGIEasy Pa-SNPs Genotyping Kit (MGI Tech Co., 
China), claimed to be specifically designed for performing NIPPT by the manufacturer 
was commercially available at the time of writing. With the increasing number of pater- 
nity tests performed worldwide, there is little doubt that more MPS panels purposely 
developed for NIPPT will soon appear on the market. 


Kinship testing 


As it was mentioned above, kinship testing is based on evaluating the probabilities that 
IBS alleles between two or more individuals are IBD as opposed to being identical by 
chance. The more distant the relationship between the individuals, the more difficult 
it is to distinguish whether the relationship is true or whether the individuals are unre- 
lated. The increase in efficiency of kinship identification can be achieved by expanding 
the number of genetic markers, testing additional family members, and increasing the DP 
of the markers analyzed. 

Although modern STR multiplexes for CE analysis contain 25—30 loci, they are 
often inadequate to determine kinship between first-degree relatives. Establishing kinship 
of the second degree is even more problematic. The advent of MPS makes it possible to 
extend the scope of kinship testing to more distant and complex relationships. 

Analysis of second-degree kinship by a custom MPS STR-only panel showed 
improved results when compared to CE-typing of the same loci (Liu et al., 2020). 
The panel contained 42 autosomal STR loci from three commercially available kits: 
the Goldeneye 20A System (PeopleSpot Inc., China), the Goldeneye 22NC System 
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(PeopleSpot Inc., China), and the Microreader 23sp System (Beying Microread Genetics 
Co., Ltd., China). An increase in DP for 40 out of the 42 loci studied was observed. Fam- 
ily simulations and likelihood ratio (LR) calculations showed that this STR panel and 
ForenSeq DNA Signature Prep Kit had similar effectiveness for second-degree kinship 
identification. 

Combining STR and SNP loci into a single sequencing panel increases the DP of 
kinship analysis. Li et al. (2019) evaluated the efficacy of the ForenSeq DNA Signature 
Prep Kit (Illumina, USA) for pairwise kinship analysis by assessing 67 individuals from 
two families. Six types of pairwise relationships, including 60 full siblings, 48 
grandparent-grandchildren, 147 uncle/aunt-nephew/nieces, 97 first cousins, and 190 
nonrelatives were generated from these two families, and the corresponding LR was 
calculated using either sequence-based or length-based STR genotyping data. The results 
showed that 54, 9, and 5 additional alleles were observed after sequencing 27 autosomal 
STRs, 24 Y-STRs, and 7 X-STRs, respectively, compared to those obtained by length- 
based information. Additionally, 11 novel alleles were identified. 51 genotypes regarded 
as homozygotes based on length information were found to be isoallelic heterozygotes 
based on sequence information. The full siblings index, grandparent-grandchild index, 
uncle/aunt-nephew/niece index, and first cousin index obtained by sequence-based 
typing were generally higher than those obtained by length-based STR analysis. 
When STRs and SNPs in the ForenSeq DNA Signature Prep Kit panel were combined 
the increase in the kinship statistics was even more pronounced. Another study using the 
same kit (Xu et al., 2019) and studies using SNP-only MPS panels (Grandell et al., 2016; 
Mo et al., 2018) confirm these observations. 

Besides STRs and SNPs, MH loci also show promise in resolving kinship cases (Qu 
et al., 2020; Staadig & Tillmar, 2021). Staadig and Tillmar (2021) examined a custom- 
made QIAseq MH panel (Qiagen, USA) containing 45 different MHs by genotyping 
75 individuals. The results showed that full-sibling cases could clearly be solved; howev- 
er, when simulating a half-sibling versus unrelated case scenario, there was some overlap 
of the LR distributions, potentially resulting in inconclusiveness. The authors suggest that 
the inclusion of more markers in the panel would positively resolve this issue. Results of 
another study using a custom-designed panel of 60 MHs and 73 individuals (Qu et al., 
2020) confirmed the usefulness of MPS-based analysis of MH loci for second-degree 
kinship inference while also noticing the need for more markers to provide more DP. 


Conclusions 


Since it first appeared NGS has made a deep impact in the fields of forensic identification 
and kinship testing. By making it possible to obtain both length-based and sequence- 
based information and because of its high throughput MPS technology provides strong 
advantages over traditional CE STR genotyping. 
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The sequence-based information about STR loci and their flanking regions permits 
the identification of new alleles and the determination of their origin. The increased 
allelic diversity improves the DP of the MPS analysis due to decreasing allelic frequencies 
relative to the number of new alleles discovered at a locus. 

The MPS technology makes it easy to integrate a very large number of genetic 
markers into a single reaction. It has now become possible to create a genotyping panel 
combining autosomal and gonosomal STR, SNP, and MH loci, as well as mtDNA data. 
Many NGS panels are already commercially available for various purposes (Table 12.1), 
with the total number of loci in some of them exceeding 1000. With the increase in the 
number of genetic markers it might soon be possible to reliably resolve third-degree or 
even more distant kinship cases. Advances in bioinformatics and the development of ge- 
netic databases make it straightforward for users to design bespoke MPS panels of any 
complexity for a wide range of purposes. 

All this makes MPS-based genotyping a powerful tool for parentage and kinship 
determination. At the same time, there are a number of important issues that need to 
be addressed prior to the wide adoption of NGS technology. Analysis of big massifs of 
genotyping data generated by MPS requires robust data storage and reliable bioinformatic 
support. The nomenclature for NGS-based STR alleles needs to be standardized, and a 
convention for reporting isoalleles needs to be developed. Many of the large numbers of 
markers included in genotyping panels are located close to each other within the same 
linkage group. This needs genetic linkage to be considered when developing statistical 
approaches for calculating parentage and kinship statistics for MPS genotyping results. 

Some of these issues have already been satisfactorily resolved, and most of them, if not 
all, will be in the very near future. With the decrease in analysis cost, it is just a matter of 
short time before MPS genotyping completely replaces the current CE-based technology 
and becomes a method of choice for parentage and kinship analysis. 
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Introduction 


Forensic genetic analysis is a powerful tool in the identification of criminals based on 
various biological forensic evidence such as blood, semen, saliva, hair, bone, teeth, and 
muscle tissues. The first method of routine human identity testing was developed in 
the mid-1980s in the United States, called restriction fragment length polymorphism 
(RFLP) (Budowle and Baechtel, 1990; Jeffreys et al., 1985; Wyman & White, 1980). 
Typical VNTR loci having high power of discrimination were typed successfully using 
RFLP analysis, with a minimum DNA requirement of at least 10—25 ng (Giusti & 
Budowle, 1995). Because of the large amount of DNA template requirement, the 
RFLP technique was soon replaced with polymerase chain reaction (PCR)-dependent 
methods for typing the genetic markers (Saiki et al., 1985). Very low amounts of 
DNA up to picogram level were routinely typed with the use of PCR in forensic case- 
work over the last few decades. 

The initial genetic marker system utilizing PCR-based assays was dependent on single 
nucleotide polymorphism (SNP) variations. The two loci named HLA-DQA1 and Poly- 
Marker loci consisting of sequence-based polymorphism were analyzed using ASO 
(Allele-specific oligonucleotide) hybridization probes in a reverse dot blot format 
(Budowle et al., 1995; Comey and Budowle, 1991; Gyllensten and Erlich, 1988; Walsh 
et al., 1991). Further, these SNP-based systems despite providing highly sensitive detec- 
tion were discouraged because of their limited power of discrimination. This led to the 
shift from early PCR-based SNP genetic markers to short tandem repeat (STR) markers, 
or microsatellites, which have now become an essential part of forensic DNA genotyp- 
ing. Multiplexing, small product size, polymorphic nature, and easy amplification by 
PCR are the key parameters of an STR marker (Budowle et al., 2001). Consequently, 
the currently employed multiplex autosomal STR loci for DNA typing of forensic sam- 
ples offer high sensitivity, specificity, and power of discrimination with the ability to type 
low DNA template samples. However, in cases of highly degraded biological samples 
with limited DNA, SNPs have proven to be better than STR typing. Thus, for 
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generating profiles from degraded samples and for the higher yield of genetic information 
from forensically compromised samples, SNPs serve as a potential alternative to STRs 
(Budowle et al., 2004). 


Characterization and origin of SNPs 


The haploid human genome consisting of approximately 3.2 billion nucleotides, exhibits 
both length- and sequence-based genetic variations (Batnaym et al., 2013). Length-based 
genetic variations represented by STRs are made up of tandem repeated sequences, of 
which each is 2—7 base pairs in length. Sequence variants representing SNPs are varia- 
tions at a single base arising due to point mutations at definite locations in the genome. 
Mutations like base substitutions, either transitions or trans versions, give rise to two or 
more alternative bases at a given position. Any Genetic variability at a base position is 
treated as an SNP only when a minimum of two alleles have frequencies greater than 
1% in a population. The number of SNPs is calculated to be 10-11 million in the human 
genome, showing the presence of one SNP per 275 bp in the human genome (Sachida- 
nandam et al., 2001; Venter et al., 2001). Although multiallelic SNPs do exist, the ma- 
jority of the SNPs are usually biallelic because of the low recurrence of mutation at one 
nucleotide, and thus it is highly unlikely that two independent base changes occur at a 
single position over time. Therefore, the frequency of single nucleotide substitution in 
the mammalian genome is estimated to be in the range of 1 x 10°’ and 5 x 10°? per 
nucleotide per year (Martinez-Anias et al., 2001). Since 85% of human genetic variations 
are due to SNPs, they are ubiquitous and the most common class of human polymor- 
phism (Holden, 2002). Some of the SNPs are restricted to one major population and 
are only termed as rare variants or single nucleotide variants. However, most of the 
SNPs are polymorphic in all human populations and are called true SNPs. 

Depending upon origin, SNPs can be found in both coding and noncoding sites of 
the genome. When they are found in the coding sites of the genes, they cause variation 
in the structure or function of the encoded protein and become an important tool in the 
diagnosis of genetic disorders. Such an SNP that affects a protein is termed a coding SNP 
(cSNP) or an exonic SNP. However, if they are found in the noncoding sites of the 
genome, the phenotype of an individual is not affected; rather, these are used in evolu- 
tionary studies, population genetics, and forensic analysis (Gill, 2001; Zhao et al., 2003). 
The application of SNP markers is well established in biological science, as whole human 
genome sequencing resulted in the finding of millions of SNPs, which are being applied 
in various areas of forensic science, i.e., phenotype prediction, testing of mitochondrial 
DNA (mtDNA), Y-SNPs as lineage markers and ancestry informative markers, and other 
probable forensic casework applications. SNPs hold potential future roles with the aim of 
developing an ideal typing method having a high level of precision and sensitivity 
adequate for the analysis of degraded DNA along with good multiplexing capacity. 


Forensic relevance of SNP analysis in next-generation sequencing 


Conventional SNP genotyping assays 


Over the last few years, multiple SNP analysis techniques have been observed. The con- 
ventional methods of SNP typing occur on the basis of four major molecular mecha- 
nisms, i.e., allele-specific hybridization, primer extension, oligonucleotide ligation, and 
invasive cleavage. A summary of these four major molecular mechanisms for allelic 
discrimination along with several methods of detection for analyzing the products is given 
in Table 13.1. However, an ideal SNP genotyping technique meeting all research re- 
quirements, particularly for forensic purposes where a low amount of DNA is available 
for casework, is still lacking. Some applications work with little DNA quantity but 
require a large number of SNPs, whereas others use few SNPs but a large quantity of 
DNA. Besides, the use of both SNPs and input DNA in large quantities thus results in 
varying levels of throughput depending upon the application used. 

Moreover, the requirement of PCR amplification in almost all conventional methods 
limits its use in forensic applications where multiplexing is being carried out. Techniques 
like the Invader Assay analyze SNP markers without carrying out a PCR process, but the 
requirement of a large amount of DNA makes them unsuitable for forensic analysis. 
Therefore, conventional SNP genotyping techniques with limited multiplexing capa- 
bility are not suitable for routine forensic DNA analysis. Another limitation lies in the 
mixture analysis of a sample due to the bi-allelic behavior of the majority of SNP 
markers. However, quantification techniques like mass spectrometry and pyrosequencing 
have enabled the determination of contributors in a mixed profile. The various detection 
platforms that have existed for a long time for SNP typing have prevented its routine 
analysis because of the lack of adoption of one gold standard method for SNP typing 
(Budowle, 2004; Sobrino et al., 2005). Hence, while choosing a technique for SNP 
typing, the parameters that need to be considered are accuracy, cost, sensitivity, flexi- 
bility, and time consumption. 


Benefits of next generation sequencing technology over 
conventional genotyping assays 


Overview 


The determination of the nucleotides in a polymer of nucleic acid is the major goal of 
sequencing. First Sequencing was conducted with the invention of the Sanger 
sequencing method in 1977, where the principle of sequencing by synthesis (SBS) 
method allowed the incorporation of a ddNTP into a growing DNA chain, thereby ter- 
minating further extension (Sanger et al., 1977). With the subsequent introduction of 
fluorescently labeled ddNTPs and capillary electrophoresis (CE) platforms, the 
sequencing of complete genomes became possible (Mitchelson, 2003). Over the last 
decade, advanced sequencing technologies have surpassed Sanger-based sequencing in 
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Table 13.1 Summary of major SNP genotyping assays along with principle and detection method for each allelic discrimination reaction. 


SNP typing assay 


S.no methods Principle Detection method Methodology Advantages Disadvantages References 


Allele-specific Using ASO probes | 1. Fluorescence Two PCR primers Highly sensitive. | Limited Wallace et al. 
hybridization/ DNA targets resonance energy along with High multiplexing (1979), 
ASO (allele- varying at one transfer (FRET) fluorescently throughput capability. Clegg 
specific nucleotide are (homogenous labelled ASO genotyping. (1992) 
oligonucleotide discriminated by hybridization) probes. Increased Single tube 
hybridization) hybridization. amount of DNA reaction for 

corresponds to PCR and 
increased detection. 
fluorescence. 

1.1 Using One probe labelled Lareu et al. 
LightCycler with fluorescein at (2001) 
(ROCHE), its 3’ end and 
which measures another with LC 
the light red 640 or 705 
intensity. label at its 5’ end 

are hybridized on 
the DNA target. 

1.2 Using Two TaqMan probes Holland et al. 
TAQMAN assay with fluorescent (1991) 
(Applied dyes at 5’ endanda 


quencher at 3’ end 
hybridize to the 
target DNA, with 
cleaving of 5’ dye 
by 5/nuclease 
activity of Taq 
polymerase. 


Biosystems) (5/ 
nuclease activity 
of Taq 
polymerase 
displaces and 
cleaves probes 
hybridized to the 
target). 


Primer extension 


Based on the 
capability of 
DNA 
polymerase to 
introduce 
specific dNTPs 
complementary 
to the template 


DNA sequence. 


Three types of 
reactions exist: 


1.3 Using 
molecular 
beacons 
(oligonucleotide 
probes with two 
complementary 
sequences 
flanking the 
complementary 
sequence to the 
target, with a 
fluorophore at 
the 5’ end and a 
quencher at its 3’ 


end). 


2. Microarrays 


(array 
hybridization) 
e.g., GeneChip 
systems 
(affymetrix) for 
each SNP use 
tens of ASO 
probes. 


When two molecular 


beacons with 
different 
fluorophores 
when binding to 
the target DNA, 
the quencher and 
the fluorophore 
are separated, and 
the fluorescence 
appears. 


Short 


oligonucleotides 
bind to solid 
supports 
hybridized with 
fluorescently 
labelled PCR 
products. 


Tyagi and 


Kramer 
(1996) 
GeneChip is Laborious to Mei et al. 
designed for create ideal (2000) 
genotyping a conditions for 
large number multiplexing. 
of SNPs. 
Kuppuswamy 


et al. (1991) 


Continued 


SNP typing assay 


methods Principle 


1. Minisequencing 


reaction or 
single 
nucleotide 
primer 
extension. 
DNA 
polymerase 
performs 
extension of a 
primer that 
anneals to its 
target DNA 
immediately 
adjacent to the 
SNP. 


Detection method 


1.1 SNaPshot 
(Applied 
Biosystems) uses 
electrophoresis. 


1.2 MALDI-TOF 
MS (matrix 
assisted laser 
desorption 
ionization time 
of flight Mass 
spectrometer) 
e.g., 
MassEXTEND, 
sequenom, 
PinPoint assay 
(Applied 
biosystems). 

1.3 Microarrays 
(fluorescence) 


Methodology 


An unlabelled primer 
at 3’ end upstream 
to the SNP region 
is extended with 
one fluorescently 
labelled ddNTP. 


Determines the 
molecular weight 
of a DNA product 
by calculating the 
time of flight. 


Performed on the 
surface of the chip 
or in solution for 
genotyping SNPs 
with mini- 
sequencing 
primers. 


Advantages 


Multiplex 
reactions can 
be done by 
structural 
separation of 
the mini- 
sequencing 
products. 

Resolution is 
very high. 
Direct method 
of detection. 


An array of arrays 
format allows 
analysis of 
many samples 


per slide. 


Table 13.1 Summary of major SNP genotyping assays along with principle and detection method for each allelic discrimination reaction.—cont'd 


Disadvantages 


Difficult to 
differentiate 
between 
heterozygous 
and 
homozygous 
genotypes. 
Pure sample 
required. 


Difficult to 
design. 


References 


Quintans et al. 
(2004), 
Sanchez 
et al. (2005) 


Haff & 
Smimov 
(1997), 
Braun et al. 
(1997) 


Pastinen et al. 
(2000) 


2. Allele-specific 
extension 
reaction works 
on the 
difference in 
extension 
efficiency of 
DNA 
polymerase 
between 
primers with 
matched and 
mismatched 3’ 
ends. 

3. Pyrosequencing 
reaction based 
on sequencing 
by synthesis 
method. 


1.4 Fluorescence 
polarization (FP) 


2.1 Microarrays 
(fluorescence) 


3.1 Luminescence 
(detection based 
on the 
pyrophosphate 
released during 
the DNA 
polymerase 
reaction) 


Primer extended by 


the allele specific 
dye-terminator on 
the template, 
thereby increasing 
the molecular 
weight of the 
fluorophore. 


Two primers are 


needed, one for 
each SNP allele. 
Primer forming 
the product is 
detected using 
fluorescently 
labelled 
nucleotides on a 
microarray. 


Light is produced by 


an enzyme cascade 
system (four 
enzymes and a 
specific substrate) 
whenever a 
nucleotide is 
complementary to 
the template. 


Used in detecting 
allelic 
discrimination 
products 
where the 
initial 
molecules 
have different 
sizes. 


Allow rapid real- 
time 
determination 
of 20-30 bps 
of target 
DNA. 


Chen et al. 
(1999) 


Pastinen et al. 
(2000) 


Ronaghi et al. 
(1998) 


Template 
preparation is 
required. 
Limited 
multiplexing 


capability. 
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SNP typing assay 


S.no methods Principle Detection method Methodology Advantages Disadvantages References 
3 Allele-specific Based on the Landergren 
oligonucleotide capability of ét al. 
(ASO) ligation DNA ligase in (1988), 
joining two Barany 
oligonucleotides (1991) 
when they 
hybridize 


adjacent to one 
another on a 
DNA template. 
Two types of 
reactions exist: 


1. Ligase chain 1.1 Fluorescence Biotin on the Landegren 
reaction. Three common probe et al. (1988) 
probes—one and reporter group 
common probe on the allele- 
annealed to the specific probe can 
target DNA; the be used to detect 
other two ASO the ligated 
probes product. 


competed to 
anneal to the 
DNA target, 
resulting in a 
‘nick’ at the 
allelic site. DNA 
ligase will ligate 


the annealed 
allelic probe to 
the common 
probe. Ligation 
products are 
amplified using 
PCR. 


. Sequence- 2.1 Electrophoresis 


coded 
separation 
reaction 


2.2 FRET (With 
the ligation, an 
increase in 
FRET is 
detected) 


Mobility modifiers 


and fluorescent 
dye-labelled 
allelic-specific 
probes enables 
multiplexed OLA, 
where ligation 
products are 
detected using a 
fluorescence DNA 
sequences. 


Three dye-labelled 


ligation 
oligonucleotides, 
where the donor 
dye labels the 5’ 
end of the 
common probe 
while the acceptor 
dye labels the 3’ 
end of the ASO 
probes. 


All the reagents 
are added 
simultaneously 
reducing the 
manual work 
and allowing 
easy 
automation. 


Grossman et 


al. (1994) 


Chen et al. 
(1998) 
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SNP typing assay 
S.no methods Principle Detection method Methodology Advantages Disadvantages References 


4 Invasive cleavage/ | Based on A reporter dye at 5’ No requirement | Target DNA is Lyamichev et 
invader assay recognition and arm and quencher for previous needed in large al. (1999), 

cleavage by a complementary to PCR amounts. Kaiser et al. 
flap the probe. With amplification. (1999), 
endonuclease the cleavage, the Mein et al. 
when two fluorophore is (2000) 
oligonucleotides removed, and 
invader fluorescence is 
nucleotide and enhanced. 
the probe 4.2 Fluorescence Molecular weight | Increase in Hsu et al. 
anneals to the polarization (FP) of the sensitivity with (2001) 
target DNA fluorophore- PCR before 
such that the labelled probe the invader 
probe overlaps decreases with the reaction. 
the 3’ end of the decrease in FP on 
invader cleaving. 
nucleotide, 4.3, Mass Different numbers of Griffin et al. 
forming a 3D spectrometry nucleotides are (1999) 
structure that is used to design 
cleaved by the probes. 
flap 
endonuclease, 
releasing the 5’ 
arm probe. 
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terms of output and overall expense, thus being overtaken by next-generation 
sequencing (NGS) technology alternatively massive parallel sequencing (MPS) (Kircher 
& Kelso, 2010). 

NGS is defined as a nonSanger-based technology that allows the parallel sequencing 
of millions or billions of DNA molecules, resulting in substantially high throughput with 
the minimal requirement for cloning of fragments commonly employed in Sanger 
sequencing (Sobiah et al., 2018). This massively parallel sequencing is obtained by the 
miniaturization of each sequencing reaction, reducing the size of the instrument along 
with the reagent cost per reaction (Qin, 2019). It includes second- and third- 
generation sequencing platforms in which miniaturization has first allowed the analysis 
of a large number of samples simultaneously and then the determination of the sequence 
of single DNA molecules (Sobiah et al., 2018). A significant trait of all NGS platforms is 
the limited read length in the range of hundreds of base pairs, which requires material to 
be fragmented prior to analysis if the DNA to be sequenced is lengthier than the suitable 
read length. At the end of sequencing, the reads are reassembled in silico to achieve the 
whole sequence of the target molecule (Qin, 2019). However, second-generation 
sequencing technology holds the drawback of having relatively small read lengths, lead- 
ing to complexities in splicing, assembly, and annotation (van Dijk et al., 2014), whereas 
third-generation sequencing technology holds the advantage of detecting single mole- 
cules along with real-time sequencing. 


Current next-generation sequencing platforms 


The use of NGS has offered a concurrent examination of forensically admissible genetic 
markers, including SNPs, mutations, STRs, and transcripts through massively parallel 
sequencing (Sobiah et al., 2018). It has been implemented mainly in the identification 
of humans along with phenotypic trait determination. Different sequencing platforms 
to present are Roche (454) sequencing, [umina sequencing, Applied Biosystems 
(SOLiD) Sequencing, Ion Torrent (Thermo Scientific), Helicos BioSciences Corpora- 
tion (Heliscope) sequencing, Pacific Biosciences SMART Sequencing, Oxford Nano- 
pore, Qiagen, and BGI (Goodwin et al., 2016). 

Two dedicated systems among the MSP platforms that are used in various forensic 
laboratories are MiSeqFGx (Illumina) and the Ion GeneStudio S5 system (Thermofisher 
Scientific). The former is one of the most widely used MPS for forensic casework with 
the ForenSeq DNA signature panel. The kit contains more than 200 forensically rele- 
vant genetic markers, consisting of 27 autosomal STRs, 24 Y-STRs, 7 X-STRs, 94 
iSNPs, 56 aSNPs, 24 pSNPs, and amelogenin (Jager et al., 2017). Ion Torrent’s Gen- 
eStudio S5 system is also broadly acknowledged for forensic casework using Precision 
ID Identity Panel Kit, which contains 90 autosomal SNPs and 34 Y upper-clade SNPs 
in a single, efficient workflow (Dash et al., 2022). 
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All these platforms are based on the principle of SBS technology and allow forensic 
scientists all over the world to utilize the maximum potential of NGS. With attributes 
such as read length, number of reads, run time, advantages, and disadvantages, a platform 
comparison between the characteristics of current NGS platforms is described in 
Table 13.2. 

The major steps for sequencing are similar for all these platforms: Construction and 
amplification of DNA templates; distribution of templates on a solid support; sequencing 
and imaging; base calling, quality control; and analysis of data. In regard to operation, two 
major approaches are taken by NGS: de novo sequencing and resequencing. In the first, 
the genome sequencing of an organism is done for the first ttme, while in the second, the 
genome of a species is sequenced and a reference sequence is readily available. On this 
basis, a strategy of sequencing is selected, and data analysis is done. The approach of rese- 
quencing is followed in Human forensics and population genetics studies, while in mi- 
crobial forensics, the two approaches of de novo sequencing and resequencing of 
microbial genomes are followed. 


NGS versus traditional capillary electrophoresis-based assays 


Although CE-based assays involving SNPs and STR genotyping have become the gold 
standard for DNA sequencing in forensic samples over the past 20 years, they hold several 
significant disadvantages. 

1. Foremost drawback of CE-based assays is their inability to multiplex. It becomes 
complex to simultaneously detect more STR markers, resulting in overlap and size 
constraints for fluorescently labeled amplicons (Oostdik et al., 2014; Wang et al., 
2015). Due to this incapability of multiplexing, samples are processed individually. 
This makes the sequencing tedious and error-prone. 

2. Another limitation is the low-resolution genotyping of current markers. There is un- 
certainty in polymorphic allelic capture having variants of substitution as traditional 
CE-based analysis is based on length-based typing; henceforth, identification of 
sequence differences present in alleles of the same length is not possible. Therefore, 
mixed DNA sample analysis could not be carried out, which reduces the resolution 
of the technology. Such compromised samples yield low detection rates, leading to a 
loss of useful genomic information. Moreover, traditional CE-based assays cannot 
determine mutations in complicated paternity cases. 


Advantages of NGS for STR and SNP genotyping 


All the novel sequencing platforms have been directed toward four main refinements 
over the traditional technologies. Firstly, these platforms rely on the construction of 
NGS libraries in a cell-devoid system, avoiding bacterial cloning of DNA fragments. Sec- 
ondly, thousands to millions of sequencing reactions can be parallelized with NGS. Also, 
analysis of multiple polymorphisms can be done efficiently, which was earlier a limitation 


Table 13.2 Comparison of characteristics of current NGS platforms. 


Manufacturer 


Illumina established 
in 1998, San 
Diego, CA 


Ion Torrent 
launched in 2010 
by life 
technologies corp 
(now part of 
thermo Fisher 
scientific) 


Roche 454 1st 
commercial 
systems was 
launched in 


2004. 


Instrument 


MiSeq FGX 


NextSeq 550 


series 


NextSeq 1000 and 
2000 series 


Ion Gene studio 
S5 system 


Ion Gene studio 
S5 Plus system 


Ion PGM 316 chip 


Ion Torrent 
GeneXus 


Roche 454 GS 
FLX 

Roche 454 GS 
FLX titanium 
system 
launched in 
2004 


Chemistry 


Read length 
(nucleotides) 


2 x 300 bp 


2 x 150 bp 


2 x 150 bp 


200 


—600 bp 


200 
—600 bp 


200 
—400 bp 


200 bp 


700 bp 


750 bp 


No. of Output 


reads (Gb) Runtime 


25 million 15 Gb 4—55h 


400 million 120 Gb 12—30 h 


1.1 billion 330 Gb 11-48 h 


2-80 
million 


15 Gb 


2-130 


million 


300 Mb 
—1 
Gb 


2-3 million 


12-15 
million 


1 x 106 


1 million 


Advantages Disadvantages 


Short run time and | Expensive per base 


ease of use. 


High yield 


Efficient, scalable, Yields fewer reads 


and low cost at a higher cost 
targeted per Mb 


sequencing 


Short run time and 
low reagent 
cost. 
Economical for 
the lowest 
sample input 


Long reads, short Homopolymer 


run time errors, 


expensive 


Applications 


Long-read 
applications 
Exome and 
transcriptome 
sequencing 
Mi 
RNA and 
sRNA analysis 
WGS (Whole 
genome 
Sequencing), 
TGS (Target 
gene 
sequencing) 
Prepare library for 
AmpliSeq, 
reproducible 
template and to 
load chip. 
Regulated lab 
environment, 
in vitro 
diagnostics, 
sRNA and de 
novo 
sequencing, 
SNP 
verification. 
WGS, termed as 
in-house NGS 
system 
Used in de novo 
sequencing 
projects. Meta 
genomics 


Continued 
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Manufacturer 


GenapSys founded 
in 2010 isa 
company from 


stanford Genome 


Technology 
center 


Applied biosystems 
SOLiD 1st 
SOLiD platform 
was released in 
2007 

Pacific biosciences 
was launched in 


2010. 


Nanopore fourth 
generation 
sequencing 


Instrument 


GenapSys (16 
chips) 


SOLiD5500x1 


PacBio (HiFi 
reads) 


Nanopore 


(PromethION) 


Chemistry 


Sequencing 
by ligation 


Real-time 
sequencing 


Real-time 
sequencing 


Read length 
(nucleotides) 


150 bp 


No. of 
reads 


13 million 


1.5 x 109 


1x 105 


Output 
(Gb) 


180 Gb 


66.5 Gb 


Runtime 


Advantages 


Low priced, easy 
to use 


Lower cost and 
inherent error 
correction 


Long reads, short 
run time 


Spatial distribution 
of DNA reads, 
large sample 
size (1—48 flow 
cells) 


Disadvantages 


Low performance 


More run time, 
short reads 


High error rate, 
low yield 


Not evaluated 


Applications 


Identifying 
pathogens, 
RNA, sWGS, 
targeted 
mRNA, SCP 
and gene 
editing. 

Whole genome 
sequencing 


Characterizing 
bacterial 
genomes and in 
human disease 
related studies. 

Identifies 
biomolecules 
while they are 
passed through 
nanoscale 
holes, 
personalized 
medicine 
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in CE-based genotyping (Churchill et al., 2016). Thirdly, no electrophoresis is required 
for the detection of sequencing output. Lastly, at an unprecedented speed, NGS gener- 
ates an enormous number of reads that enable the sequencing of entire genomes. Besides 
this, as already mentioned NGS typing is based on SBS chemistry (Bentley et al., 2008), 
which results in benefits like error removal, low incorporation bias, and missed base calls 
in the homopolymers and repetitive regions (Bragg et al., 2013). A prominent advantage 
with NGS is the high level of sensitivity with digital data in spite of the analog metrics like 
shape, size, height, and color of the peak resulting from traditional CE-based systems. 
Quantitative data is developed by NGS in the form of classifiable read counts. Moreover, 
sensitivity can also be increased with an increase in coverage level and in-depth overall 
sequencing, which aid in achieving the small contribution of donors in a complex 
mixture sample. This outcome cannot be achieved with a CE-based assay. 

Although the most accepted STR genotyping technique provides satisfactory 
discrimination power for a wide range of forensic applications, NGS-generated STRs 
hold many potential advantages in the forensic analysis of challenged samples. Some of 
the advantages of NGS for the analysis of STR are high throughput, concurrent detec- 
tion of a large number of STR loci on both autosomes and allosomes, and the capability 
of differentiating alleles with similar digital read counts. Also, NGS technology facilitates 
the analysis of mixed DNA samples and complex paternity cases, eventually upgrading 
the workings of legal cases at a lower cost. STR allele calls generated using NGS are 
wholly in accordance with the current database format, establishing an ideal relationship 
between NGS and CE-based data. NGS has also become a tool for differentiating be- 
tween mutations and exclusions, with more STR loci being detected simultaneously. 
Meanwhile, NGS technology identifies the sequence of the STR loci so that the varia- 
tion in the repeat within the STR loci can be detected, which may conclude the origin of 
mutation, thereby increasing the power of discrimination. 

SNP detection possesses the explicit advantage of NGS application in the realm of 
forensic genomics. It offers the capability of differentiating alleles similar in length but 
differing in sequence. Over the years, forensic DNA phenotyping (FDP) employing 
SNP genetic marker has been used extensively for criminal investigation when STRs 
are insufficient. The markers work along with bio-geographical ancestry (BGA) and 
morphologically visible characters (Kayser, 2015). Compared to STR markers, they 
possess lower mutation rates—about 100,000 times lower than STRs and are thus 
more stable. SNPs are unlikely to change over time, stating that they remain the same 
their whole lives (Biesecker et al., 2005). This makes them more favorable to be used 
in kinship analysis, pedigree or lineage reconstruction cases, and mass disaster victim iden- 
tification. SNPs can work effectively with compromised biological samples with low- 
quantity DNA templates and/or degraded samples. In such compromised samples, 
STR analysis may not yield a better result due to the requirement of a larger intact frag- 
ment of DNA. With a small fragmented DNA length of about 60—80 bps, SNP analysis 
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can be performed (Budowle et al., 2004; Divne and Allen, 2005). The study of pheno- 
typic characteristics such as hair and eye color has also improved with SNP genotyping 
through NGS. Since most SNPs are biallelic, it decreases the informativeness per locus 
when compared to STRs. However, this limitation can be easily compensated with a 
much larger set of SNPs using the NGS technique to accomplish the same power of 
discrimination afforded with the core STR loci. A set of about 50—100 SNPs is equiv- 
alently discriminating as a set of 10 STRs (Chakraborty et al., 1999). Table 13.3. 


Forensic application of SNP markers using NGS technology 
Analysis of highly degraded and mixed samples 


In routine forensic DNA analysis, sample quantity and quality are major concerns. In 
conventional techniques using STR markers, analysts encounter problems majorly 
from highly degraded and contaminated DNA recovered from challenging samples, 
i.e., samples from highly decomposed human corpses, cremated dead bodies, ancient 
skeletonized bodies, charred bones, teeth, and hair shafts exposed to harsh environmental 
conditions (Carrasco et al., 2020). However, the emergence of advanced technology in 


Table 13.3 Summary of comparison of characteristics of STRs and SNPs. 


Characteristics STRs SNPs 

Mutation rate 10° nucleotide per 10° °—10-? nucleotide per 
generation generation 

Analysis of degraded samples Less suitable More suitable 


Power of discrimination High Low; but same level of 
discrimination can be 
achieved by multiplexing 
Mixture analysis Poor resolving power High resolving power due to 
large number of markers 
and SNP 


microhaplotypes. 

Rare polymorphism Cannot be studied Can be reliably detected by 
analyzing every base of the 
genome. 

Mitochondrial DNA analysis Detects polymorphism only | Potential to analyze whole 

within a hypervariable mitochondrial sequence 
region. and heteroplasmy is 


detected (Holland et al, 

2011). 

Ethnicity estimation, physical | Not reported Reported with an accuracy 
traits, etc of 90%. 
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the form of NGS opened new possibilities for human identification from challenging and 
compromised samples. In the case of degraded samples, SNP is a better potential marker 
for human identification than other available technologies because it generates short read 
lengths and amplicons (<100 bp) (Butler, 2011, pp. 347—362). Therefore, it lowers the 
error rate, which is a major advantage while dealing with degraded samples. To deal with 
degraded samples, there is an urgent need to employ new approaches in forensic case- 
work analysis. In the past few years, various NGS methods have manifested efficacy 
with compromised samples (Cho et al., 2020; Turchi et al., 2020) and mixed samples 
with unequal ratios of DNA donors (Chen et al., 2020; Hwa et al., 2020; Momota 
et al., 2021; Petrovick et al., 2020; Ricke et al., 2021). The human identification SNP 
multiplex assay, i.e., ID 52-plex, was developed using 52 unlinked bi-allelic SNPs in Eu- 
ropean, Asian, and African populations and carried out DNA genotyping. The study re- 
ported that 52 SNP loci were efficiently amplified with a mean match probability of 
5 x 10!” from 700 degraded samples of the studied population, while partial STR pro- 
files were generated from the CE method (Sanchez et al., 2006). Therefore, SNP proved 
to be a more efficient marker than STR in the case of analysis of degraded DNA samples 
(Sanchez et al., 2006). Cho et al. (2020) tested 70-year-old postmortem samples of hu- 
man skeletal remains using the MPS-based STR system, which is recommended best for 
degraded samples, and reported the higher success rate of the MPS-based STR system 
compared to the CE-based STR typing method. Another study showed the implication 
of NGS technology by analyzing degraded vs. non-degraded and mixed samples of two 
close relatives in varied ratios (minor to major contributors), i.e., 1:9, 1:19, 1:29, 1:39, 1: 
79, and 1:99 (Hwa et al., 2020). Genotyping results of MPS-based 59-identity STRs and 
94-identity SNPs showed that 82.4% and 54.5% of minor contributors could be precisely 
recognized from both intact and degraded DNA mixtures, respectively (Hwa et al., 
2020). Additionally, micro-haplotype assays have also been developed to emphasize un- 
equal contributors to the DNA mixture (Bennett. et al., 2019; Chen et al., 2019). 

Of 273,566 multi-allelic SNPs (from the 1000 Genomes Phase III) located across the 
human chromosomes (Phillips et al., 2020a, 2020b), approximately 99% are triallelic and 
~ 1% are tetra-allelic SNPs. Compared to the biallelic SNPs, these SNPs have high 
discrimination power and are helpful in analyzing highly degraded and mixed DNA pat- 
terns (Phillips et al. 2020b). A large-scale MPS-based tetra-allelic SNP panel has been 
designed by selecting 160 SNPs with high heterozygosity and 24 SNPs with high discrim- 
inating power. This panel enables the detection of more than two alleles in at least one 
marker in 99% of mixed sample cases belonging to African and European donors, while 
it is reduced to 92.6% in the case of Asian donors (Phillips et al., 2015). A SNaPshot 
approach-based panel, the multiple-allele SNP test for forensics, containing 27 triallelic 
and 2 tetraallelic SNPs was designed for forensically degraded samples and DNA mixture 
deconvolution. It was presented with high discrimination power among all five main pop- 
ulation groups, with a random match probability value ranging from 107‘? to 10-7” 
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(Phillips et al. 2020b). This small panel’s information could be used for the development of 
large-scale MPS-based multi-allelic SNP panels that are thought to be more potent for 
degraded and DNA mixture deconvolution. 

With the availability of refined genotyping techniques for the isolation of DNA from 
degraded skeletal material, we are still facing a lack of sequencing success due to the low 
DNA quantity and quality of compromised samples. A recent study recommended the 
implementation of optimization strategies in sequencing workflow while dealing with 
highly degraded samples (with concentrations >31.2 pg), such as increasing the volume 
of the library pool, performing additional PCR purification, and reducing the adapter 
volume (Senst et al., 2022). Prior to massively parallel sequencing, target enrichment ap- 
proaches using biotinylated probes have also shown remarkable results in the case of 
degraded and ancient DNA (Bardan et al., 2023). The power of advanced genetic tech- 
nology can be evidenced by the recent discovery of Scientist Svante Paabo, awarded the 
“2022 Nobel Prize in Physiology or Medicine,” who decoded the mtDNA from a 
40,000-year-old archaic bone sample of Neanderthal, a species of humans that went 
extinct (Vernot et al., 2021). 


Low-copy DNA detection with NGS approach 


Presently, NGS platforms are forensically validated with multiple panels having different 
sensitivities for input DNA varying from 62.5 pg to 1 ng for autosomes, X and Y chro- 
mosome SNPs, and STR (Carratto et al., 2022). The first commercially available MPS 
STR panel for the MiSeq FGx platform is ForenSeq DNA Signature (lumina, CA), 
which includes 172 SNPs, 27 autosomal STRs, 7-X STRs, and 24 Y-STRs. Jager et 
al. (2017) validated the kit and reported 89% of complete STR profiles of 26 autosomal 
loci (out of a total of 27 loci) from all dilutions of DNA ranging from 1 ng to 62.5 pg. 
The D22S1045 marker was excluded from this study due to its poor sensitivity, whereas 
X and Y STRs showed 100% generation of profiles with 1 ng input DNA. Besides, no 
significant decrease in sensitivity was found up to 62.5 pg of input DNA. Input DNA 
quantities below 8 pg depicted more than 50% loss of SNP calls. The sensitivity of the 
Identity Panel (which includes 124 SNPs) for human identification is as low as 100 pg 
of DNA, while the Ancestry Panel, which encompasses 165 biogeographic ancestry 
SNPs, requires 125 pg of DNA. The GlobalFiler NGS STR Panel includes 35 STR 
(autosomal, X, and Y chromosomes), which is an important panel for mixed samples 
in varied ratios and requires a minimum of 125 pg of input DNA. A panel for combined 
forensic phenotyping and ancestry determination, the VISAGE-Basic Tool Research 
Panel (which includes 152 SNPs and 1 Indel, of which 41 SNPs belong to the 
HIrisPlex-S system) generates a profile with 1 ng of template DNA. Its validation study 
reported the generation of a full profile at a dilution of 100 pg of DNA (Xavier et al., 
2020). An all-in-one SNP panel, FORensic Capture Enrichment (FORCE), comprising 
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5422 forensically relevant SNPs (identity, ancestry, phenotype, and lineage SNPs), was 
used to generate a profile from the 200-year-old bone samples. 44.4% of the SNPs pre- 
sent in the panel could be generated successfully from such an old bone sample (Tillmar 
étal.,, 2021). 


Individual identification 


SNP markers with high power to differentiate among individuals are known as identity- 
testing SNPs. It represents the same function as conventional Combined DNA Index 
System (CODIS) STR markers but with different discrimination power, sensitivity, 
and specificity. SNPs with the highest heterozygosity (50% heterozygosity) and low 
Fst values are considered the ideal identity markers. Kidd et al. (2006) screened 90,000 
candidate SNPs in European Americans, African Americans, and Chinese/Japanese 
and selected 2723 markers with average heterozygosity >0.45 and Fst <0.01 for the three 
populations. Among 2723 SNPs, another 35 were chosen after testing in a broad popu- 
lation set, and finally, 19 identity SNPs were selected with a high value of heterozygosity 
and a low Fst in all 40 populations. SNP markers are identified as an efficient profiling 
tool for individual identification with the application of advanced throughput tech- 
niques. Several NGS-based kits are commercially available for CODIS STR loci used 
for forensic testing. Currently, major NGS platforms available in the market are Ilumi- 
na’s MiSeq (San Diego, CA) and the Ion S5 PGM System (Thermo Fisher) which are 
forensically validated with various MPS kits for CODIS STRs as well as NGS SNP 
multiplex panels for individual identification, such as the ForenSeq DNA signature panel 
(Illumina), the MPS panel for SNP for ID 52-plex (Valle-Silva et al., 2019) and the 92- 
Plex IISNPs, SNP panel for 140-locus (QIAGEN) (Avent et al., 2019), the 21-plex STR 
identity panel (Silvery et al., 2020), and the SifaMPS panel comprising of 87 STRs and 
294 SNPs (Tao et al., 2021). NGS panels with Precision ID identities for 124 SNPs, a 
448-plex SNP panel (Zhao et al., 2021), and a multiplex panel of 1270 triallelic SNPs 
(Phillips et al., 2009) are also in use. Multiple studies have demonstrated that autosomal 
and Y-STR analysis using MPS panel-based genotyping-by-sequencing (Syndercombe- 
Court, 2021; de Knijff, 2022) is a more robust and sensitive technique than CE-based 
genotyping-by-length approaches. NGS STR panel Precision ID identity (Thermo Sci- 
entific, US) has been developed for human individualization using 124 identity SNPs 
(comprising 90 autosomal and 34 Y-chromosomes). A validation study has assessed these 
identity SNPs in the Central Indian population, on the Ion GeneStudio S5 instrument 
and reported that the allelic frequencies of SNPs ranged from 0.001 to 0.377 in autosomal 
SNPs. A SNP, 159951171, was reported to be the most potent marker in the Central In- 
dian population with the highest power of discrimination and the lowest cumulative 
match probability value of 4.76698 x 10~°’ (Dash et al., 2022b). The high- 
throughput NGS technique has major importance in solving the discrimination of closely 
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related individuals by sequencing and scanning the mutations in flanking regions of STR 
loci, which elucidate the newer alleles that ultimately increase the discriminating power 
(Rockenbauer et al., 2014; Devesse et al., 2018). In the Indian scenario, Dash et al. (2021) 
conducted the first study conferring sequence-based allele determination in the Central 
Indian population utilizing the NGS STR panel Precision ID GlobalFiler v2. A subse- 
quent increase in allele number was observed in 31 autosomal markers due to mutations 
found in the flanking region of STR loci and also revealed a most potent substitution or 
mutation rs4847015 at trinucleotide STR marker D1S1656 in 11.5% of the Central In- 
dian population (Dash et al., 2022). Allele gain at intervening regions as well as flanking 
regions at multiple STR loci increases the statistical or discrimination power, leading to a 
higher evidentiary value of NGS-based SNP analysis compared to CE-based STR anal- 
ysis (Dash et al., 2021). Additionally, multiallelic SNPs (triallelic or tetraallelic) have a 
high genotypic combination, thereby increasing the level of polymorphism and subse- 
quently the power of discrimination value compared to the biallelic markers (Cao et 
al., 2015; Cornelius et al., 2019; Gao et al., 2018; Phillips et al., 2015; Westen et al., 
2009). SNPs rs4540055 and 155030240 are reported to be the most potent triallelic 
forensic markers for individual identification (Cao et al., 2015; Westen et al., 2009). In 
combination with SNPs, Insertion deletions (InDels), which are the second most abun- 
dant biallelic polymorphisms (approx. 3.5 million InDels) in the human genome (The 
1000 Genomes Project Consortium et al., 2015). Considering the lower mutation rates 
(~ 2.9 x 107°) (Nachman and Crowell, 2000) and shorter amplicon sizes resulting in a 
low rate of ambiguous sequencing results such as stutter generation, therefore InDels are 
also deciphered as another useful marker for forensic cases than STRs (Al-Snan et al., 
2021; Zaumsegel et al., 2013). 


Human ancestry and migration history 


Although identity SNPs and STR loci are powerful markers for individualization, they 
are inadequate for elucidating the biogeographical evolutionary ancestry and migration 
history of an individual due to the high chance of allele-sharing of these markers among 
populations (Zheng et al., 2018). Approximately 80%—90% of the genetic variations are 
shared among all individuals, while only 10%—20% of the total variation is population- 
specific, which can be referred to as “private variation” in a population (Deka et al., 1995; 
Nei, 1987). Allelic frequencies of these ancestry markers significantly vary among 
different continental and subcontinental populations (Kosoy et al., 2009). A commercial 
kit, DNA Witness BGA was developed to elucidate the ancestry of four continental pop- 
ulations, viz., American, European, East Asian, and sub-Saharan African, by calculating 
the proportion of different ancestry components in admixed individuals (Frudakis, 
2008; Shriver et al., 1997). Furthermore, two more array-based genotyping panels 
were developed: The first panel consisted of 320 Eurasian-specific ancestry markers 


Forensic relevance of SNP analysis in next-generation sequencing 


that determine admixture among Southeastern European, Northern European, Middle 
Eastern, and South Asian populations (Frudakis, 2008), and the other panel consisted 
of 1476 European ancestry markers that have the power to elucidate European subances- 
try such as Jewish, Italian, German, Irish, Finnish, Greek, Spanish, and Russian, etc. (Bau- 
chet et al., 2007; Frudakis, 2008). The VISible Attributes through GEnomics (VISAGE) 
Basic Tool (BT) is the first forensic MPS single assay test comprising 120 markers for pre- 
dicting both external visible characteristics (EVCs) as well as continental-based ancestry. 
In 2020, a consortium VISAGE expanded the use of MPS-based techniques for predict- 
ing the biogeographic ancestry, facial phenotype, and age of an unknown accused sample 
(Palencia~Madrid et al., 2020). Bardan et al. (2023) developed a customized hybridization 
capture panel that detected 75% of correct predictions of the European biogeographic 
ancestry of individuals and 83% and 92% of correct predictions of hair and eye color, 
respectively (Ruiz et al., 2013). In recent years, a number of BGA panels compassed 
in predicting the genetic ancestry of the population at the continental level (Gettings 
et al., 2014; Lao et al., 2006; Pereira et al., 2019; Phillips et al., 2013; Qu et al., 2019; 
Santos et al., 2016; Xavier et al., 2020; Zhao et al., 2019). However, we are faced 
with compromised accuracy in determination of biogeographic ancestry when studying 
strictly geographically confined populations, which may be due to candidate SNP selec- 
tion and their limited number (Mehmood et al., 2012). Recently, a study reported the 
enhanced capacity of concluding ancestry at the continental and subcontinental levels 
by using four forensic SNP panels with the implementation of an algorithm for partial 
least squares discriminant analysis (PLS-DA). In line with this study, Pilli et al. (2023) 
developed a new BGA panel comprising >3000 SNPs and evaluated the accuracy of 
the panel using the PLS-DA method. By using different selection techniques and genetic 
algorithms, we reduced the number of markers by selecting the 800 most informative 
SNPs to infer the ancestry of populations at continental and subcontinental scales. 

With the fact that a phenotype is determined by both genotype and environment 
when an ancestral population migrates to an extreme environmental or geographical 
location, it reshapes a population’s ancestry under the influence of genetic adaptation 
to a microenvironment due to the process of natural selection (Harris & Pritchard, 
2017). An Indian study of whole genome sequencing (WGS) of a highland native 
from Ladakh, an individual from submountaneous Kumaun region, and an individual 
from Telangana (Andhra Pradesh) revealed high altitude-relevant genetic variants of 
EGLN1 [rs12097901 (Cys 127Ser) and 15186996510 (Asp4Glu)] in Ladakh natives (Mal- 
hotra et al., 2018). This corroborates with many studies of EGLN7 being positively 
selected for high altitude adaptation (Lorenzo et al., 2014; Xiang et al., 2013). In a 
forensic context, these adaptation loci could also be useful in establishing a person’s 
biogeographic history, thus helping to concretize the possible matches from a list of 
missing individuals. 
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Facial phenotyping 

Recent developments in genotyping or sequencing techniques have proven SNPs to be a 
very informative tool in determining a person’s face EVCs or phenotypic traits such as 
skin color, hair color, eye color, and face morphology and hair morphology. EVCs are 
polygenic in nature, which is determined by the complex interplay of an individual’s ge- 
notypes as well as environmental factors (including altitude, latitude, diet, exercise, and 
stress exposure such as UV exposure, heat, cold, oxygen level, etc.). However, by 
applying advanced high-throughput techniques, additive genotype information of 
various phenotypic markers helps a forensic scientist predict the physical appearance of 
unknown perpetrators based on a DNA sample found at the crime scene (Kayser, 
2015; Seo et al., 2013), which provides a lead to investigating agencies and narrows 
down the list of suspects. This inferential technique is known as FDP. The FDP tech- 
nique would be especially useful in forensic cases where there is no reference sample 
available for comparison with an unknown DNA sample. 

Human pigmentation phenotypes such as hair color, eye color, and skin color have 
been extensively studied. Besides these traits, other polygenic phenotypic traits such as 
face morphology, freckles, age, height, and myopia have been rarely studied. The vari- 
ation in human skin, eye, and hair color is due to the amount of production and two 
types of melanin pigment involved in the pigmentation pathway. The melanin pigment 
type that is responsible for a darker (brown/black) phenotype is called eumelanin, and 
another pigment responsible for a lighter (reddish) color is called pheomelanin. Some 
candidate genes, viz., OCA2, HERC2, SLC24A5, and SLC45A2, encoding membrane 
ion transport proteins have shown a significant role in the amount of melanin production 
by affecting the tyrosinase activity encoded by TYR. Three SNPs located on the OCA2 
gene, viz., 56497268, 1s7495174, and 111855019, have been shown to have a strong 
association with blue versus nonblue eye color (Duffy et al., 2007). Moreover, another 
association study reported the strongest significant association of green iris color with 
OCA2 marker 131800407 and two important eye color variants (rs74653330 and 
18121918166) (Branicki et al., 2008; Andersen et al., 2011). Except for OCA2, a 
genome-wide association study (GWAS) has observed up to 15 SNP markers in 
HERC-2 gene. Among them, 15916977 (where the T ancestral allele is associated with 
brown iris color and the C allele is associated with blue iris color) and rs 1667394 emerged 
as important markers in the determination of eye-color variation in European descent 
(Kayser et al., 2008). A number of phenotype-genotype association studies helped in 
the development of efficient and reliable multiplex genotyping panels such as IrisPlex 
for prediction of eye color (blue/brown) consisting of six potent markers in different 
loci, viz., OCA2 (rs1800407), SLC24A4 (1512896399), SLC45A2 (1516891982), 
HERC2 (1512913832), IRF4 (rs12203592), and TYR (ts1393350) based on the European 
(Dutch) population genotype and phenotype data (Ruiz et al., 2013; Yun et al., 2014). 


Forensic relevance of SNP analysis in next-generation sequencing 


These phenotype-related markers were further scanned in other European countries 
(including Norway, Estonia, the UK, France, Spain, Italy, and Greece), manifesting 
high predictive power (area under curve = 0.96) for two eye colors (blue and brown) 
(Walsh et al., 2011). It was observed that for intermediate eye colors, IrisPlex could 
not give as accurate a prediction as with blue and brown eyes (Liu et al., 2009; Walsh 
et al., 2011, 2012). As per the available studies, along with OCA2 15180040, HERC2 
1812913832 was found to be the most potent marker, with highest prediction values 
ranging from 68.8% to 74.8% for eye color variation (Andersen et al., 2011; Valenzuela 
et al., 2010). However, the selection of the ideal SNP markers for the phenotypic pre- 
diction model appears to be highly population-specific. In the Polish population, 
181408799 was significantly associated with eye color, 151800416 in the Brazilian popu- 
lation, and rs916977 in the Czech population (Andrade et al., 2017; Zidkova et al., 2013). 
Genetic variations associated with skin pigmentation were observed among individuals 
belonging to biogeographically distinct continental countries of the world (Asian, Afri- 
can, Caucasian, and Australian) or also among individuals belonging to the same 
geographic region depicting a continuous variation in skin color shade. 

Genetic polymorphisms associated with hair pigmentation variation are SLC45A2 
1816891982, SLC452 1526722, and ASIP 15369378152 (van Daal, 2008). Among 
them, significant allele frequency variation was observed for SLC45A2 (rs16891982) 
in dark-haired nonCaucasian continents (viz., Asians, Americans, and Africans) and 
light-pigmented Caucasian populations (van Daal, 2008). A cSNP 153827760 in the 
gene ectodysplasin A receptor was identified as the dominant marker for straight hair 
in East Asians (Raddell et al., 2020). With time and increased available genotype pheno- 
type data, another HIrisPlex model was developed for combined prediction of eye and 
hair color, consisting of 22 most predictive SNPs (4 markers of IrisPlex+18 new 
markers). The HIrisPlex tool has shown BGA prediction accuracies of 69.5% for blond 
hair color, 78.5% for brown, 80% for red, and 87.5% for black independently in the Eu- 
ropean population. However, when it was tested on a nonEuropean population, it 
demonstrated satisfactory eye/hair color prediction but with minor accuracy (Balanovsky 
et al., 2019). Moreover, a study identified HERC2 rs 1129038 as a potent marker for the 
prediction of blond hair, HERC2 1512913832 for black hair, and HERC2 1512931267 
and TPCN2 1s35264875 as ideal markers for brown-colored hair. Later on, the third- 
most important combined model, HIrisPlex-S was created for concurrent prediction 
of eye, hair, and skin color, encompassing 36 markers (of which 19 markers from HIr- 
isPlex 17 novel markers). Compared to other models, the HIrisPlex-S tool proved to 
have high prediction accuracy and discriminating power among different tones of skin 
ona global scale, i.e., 72% accuracy for pale skin, 74% for very pale, 73% for intermediate, 
87% for dark, and 97% for dark black (Chaitanya et al., 2018). The compiled phenotype 
and genotype data from IrisPlex, HIrisPlex, and HIrisPlex-S systems are publicly available 
in an interactive tool (https://hirisplex.erasmusme.nl/). Through this tool, prediction 
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models can generate individual probabilities for available categories of five skin colors, 
three eye colors, and four hair colors from genotype data for 41 markers. Alternatively, 
the physically visible appearance of an individual may also infer his BGA. 

In the context of facial phenotyping, a vast literature is available on the application of 
GWAS rather than NGS techniques on a genomic scale. Over the last decade, multiple 
GWAS studies have revealed various genetic markers associated with facial shape pheno- 
types (Xiong et al., 2019) and syndromes associated with abnormal facial features such as 
Down syndrome, Prader-Willi syndrome, and cleft lip (Boehringer et al., 2011; Brinkley 
et al., 2016). The application of high-throughput NGS-based genotyping techniques has 
considerably accelerated the process of causal gene discovery in the case of polygenic 
complex traits with effective prediction accuracy. Another robust technique whole 
exome sequencing (WES) allows sequencing of only the protein-coding region (known 
as the exome), which forms 2% of the whole genome (Ng et al., 2009). Approximately 
80% of mutations in the exome region cause Mendelian diseases. In the field of medical 
genetics, the WES approach has become a robust technique for the prognosis of rare ge- 
netic disorders that have a prevalence below 1/2000 (Ng et al., 2010). Therefore, the 
application of WES or WGS can prove an important technique for the diagnosis of 
craniofacial Mendelian traits or genetic syndromes related to facial feature abnormalities 
through the unraveling of rare pathogenic variants (Al-Nbaheen, 2018; Yang et al., 
2013). Another recent phenotyping application, Face2Gene (http://www.face2gene. 
com) is available for clinical and research purposes and is basically useful for the evaluation 
of genetic syndromes related to facial dysmorphology by using patient information such 
as photo, gender, weight, height, etc. Limited studies are available in the context of 
phenotypic and genotypic association studies in the Indian population. Extensive 
research is needed for the discovery of robust candidate markers, new MPS panels/assays, 
and their validation studies in different populations for these phenotypic markers. 


Paternal lineage and kinship analysis 


Paternal lineage and kinship-associated SNPs have been extensively studied in the X 
chromosome and uniparental Y chromosome, as well as in mtDNA. Lack of recombina- 
tion and a lower mutation rate (10°) than STR markers (107°) are the most important 
features of these lineage markers (Amorim and Pereira, 2005). In forensics, lineage SNPs 
have proven advantageous in the kinship analysis of an individual in cases of mass disaster 
identification or missing person identification, mostly in the conditions of the unavail- 
ability of close first-degree relatives as reference samples. 

Numerous studies are available on the validation and assessment of various Y-STR 
typing systems, such as a 24 plex (Fan et al., 2021), a 41 plex (Liu et al., 2020), a multiplex 
with 12 multicopy Y-STR loci (Shang et al., 2021), and a 26 plex for rapidly mutating Y- 
STRs (Albastaki et al., 2022). Such studies have proven the importance of lineage SNPs 
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in sexual assault and paternity lineage differentiation cases. Nevertheless, limitations of 
these CE-based commercial Y-STR multiplex kits are being faced in the identification 
of closely related males belonging to the same paternal lineage and mixed samples of 
incest cases due to the unavailability of a sufficient number of loci in any single amplifi- 
cation kit. In such situations of inconsistent results from CE-based STR typing, genetic 
markers such as Ins/Del (Gonzalez-Herrera et al., 2020), haplogroup analysis, and SNP 
markers (Borsting & Morling, 2011) have been identified as alternative markers in com- 
plex paternity cases. In this regard, Penta E was found to be the most potent STR marker 
for paternity determination in the Indian population, with heterozygosity (0.906) and the 
lowest value of match probability (0.022) (Dash et al., 2021). 

Multiple studies are available on NGS technology-based multiplex panels used for pa- 
ternity and kinship analysis. In many father-child duo cases, the 49 SNP genotyping in- 
formation of the 52-plex test helped in assessing the paternity, while the STR profile was 
found to be insignificant in proving the paternity of the alleged father. Moreover, in an 
incest case, the grandfather was convicted as the real biological father, and the putative 
father was proved innocent by the SNP analysis (Borsting & Morling, 2011). Hence, 
in addition to STR studies, these cases showed the impact of SNP markers in assessing 
complex paternity and kinship cases. The World War II case of family relationship was 
solved using the all-in-one SNP panel FORCE, which encompasses 5422 forensically 
relevant markers (X, Y SNP, ancestry SNP, and phenotype markers) by predicting the 
distant relationships (up to the fifth degree) with strong statistical results (Tillmar et al., 
2021). 

Y-SNP haplogroup studies utilizing high-resolution NGS approaches to enhance our 
understanding of male individuals in forensic genetics and evolutionary studies. Dash 
et al. (2022) NGS study revealed 11 unique haplogroups, with R1a1 reported as the 
most dominant haplogroup (22.22%) found in the Central Indian population by Y- 
chromosome 34 SNP analysis using the Precision ID identity panel. An NGS panel con- 
sisting of both Y-SNPs (n = 15,611) and Y-STRs (n = 202) (including slow, moderate, 
and RM Y-STRs) was developed to individualize close paternal relatives worldwide by 
analyzing the 1443 Y-haplogroup lineages (Claerhout et al., 2021). The SifaMPS panel 
developed using the predominant 381 Y SNP provided better haplogroup analysis in the 
Chinese population, which could be adapted for paternal lineage analysis, forensic gene- 
alogy studies, and familial searching (Tao et al., 2021). The development of new assays 
with potential markers will be applicable in diverse populations that need continuous 
research and validation studies in different populations with large datasets. 


Limitations of SNP analysis in forensics 


Applications of NGS-based SNP analysis still pose several cons or technical issues in 
forensic sample analysis. The biallelic nature of most SNPs reduces their discrimination 
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power per locus in comparison to polymorphic STRs. However, this limitation is easily 
compensated with a much larger set of SNPs with the use of NGS technique to achieve 
the same discrimination power as imparted with the core STR loci (Chakraborty et al., 
1999), Data interpretation has become difficult as more loci and amplification products 
are obtained. Locus dropout is another challenge faced with an increase in the number of 
loci in comparison to conventional SNP genotyping. As a result, MPS-based SNP panels 
with a larger number of markers may be prone to problems due to low integrity and low 
copy numbers in the input DNA template. An increase in the number of loci means more 
peak signals and the likelihood of more artifacts, demanding trained manpower or experts 
for NGS DNA profile analysis in forensic laboratories. Some issues are being faced with 
NGS-based SNP panels, such as the low validation rate of some multiplexing genotyping 
array platforms and the lack of availability of population-specific allele information, 
which is important for global applicability with the high sensitivity of a panel. The major 
technical issue in SNP genotyping through many NGS platforms is majorly due to the 
sequencing of homopolymeric regions (Loman et al. 2015). Some SNP loci, 
18143232, and rs1031825, located inside the homopolymer stretches showed a problem 
of allelic imbalance on specifically nanopore sequencer, but issues with the Ion Torrent 
platform have been reported (Loman et al., 2015). However, the high sensitivity of NGS 
technology could establish a match accuracy of 99.9% while comparing an unknown 
sample against a database of reference sequences. Nevertheless, this technique seems 
incongruous in forensic case matching due to the unavailability of DNA sequence data- 
bases of criminals. Genetic profiling data available in the forensic DNA database is stored 
in the form of CE-based STR typing. Hence, constant efforts are needed in the forensic 
fraternity to make an SNP a replaceable tool of STR in the foreseeable future. 


Conclusions 


The discovery of SNPs apparently made a breakthrough with the development of NGS 
technologies. Application of the NGS-based SNP genotyping technique is advantageous 
in analyzing compromised forensic samples, such as highly degraded and mixed samples, 
complex kinship and paternity analysis in sexual assault cases, BGA, forensic phenotyping, 
and identification for missing individuals and unidentified human remains by providing 
investigative clues without a suspect. Despite the capability of providing sufficient data on 
loci across the genome, NGS technology needs to work on low template library prepa- 
ration, error rate, and issues with NGS data processing and mining so that it becomes a 
routine application in forensic science. Except for human identification via SNP geno- 
typing, NGS is being dominated in other research disciplines, including body fluid iden- 
tification, identification of monozygotic twins, the age of the perpetrator through DNA 
methylation techniques, nonhuman species identification, and microbial forensics. 
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Introduction to hair analysis in forensic science 


Humans shed a normal amount of 50—100 hairs per day (Aparecida de Franca et al., 
2015). Hair samples are a common type of trace evidence encountered at crime scenes. 
Hair samples that are collected at crime scenes represent about 90% of the evidence most 
commonly found (Amory et al., 2007). Hair evidence plays an important part in connect- 
ing the suspect to the victim or vice versa. It can help investigators with leads of who was, 
and who was not, present at the crime scene. 

DNA analysis is a popular method of identification and can be performed on hair ev- 
idence if the item can first be identified as human hair. Hair evidence samples are examined 
microscopically by forensic examiners to differentiate between human and animal hair. 
Human and animal hairs are composed of three basic regions: the cuticle, medulla, and cor- 
tex (Deedrick & Koch, 2004). These regions vary and have different morphological char- 
acteristics among various animal species and humans. Other morphological characteristics 
such as color, pigmentation, and the overall diameter of the hair shaft can help in differ- 
entiating animal from human hair (Deedrick & Koch, 2004). Microscopy examination of 
hairs can sometimes be viewed as subjective with very minimal statistical analysis (Man- 
heim, 2015). A study conducted in 2015 used attenuated total reflectance Fourier trans- 
form infrared (ATR FTIR) spectroscopy to determine a more preliminary objective 
approach to identifying human hair from animal hairs and fibers (Manheim, 2015). 
The study performed hair analysis on human, dog, and cat hairs by collecting 10 spectra 
for each individual and animal. The tool was concluded and demonstrated to be effective 
in identifying human hair from dog and cat hair (Manheim, 2015). ATR-FTIR can bea 
good method for aiding in the identification of human hair due to its quick and nonde- 
structive nature toward forensic evidence. This nondestructive technique is important 
when human hair evidence is sent for DNA analysis for further identification of an indi- 
vidual. Having a well-preserved hair sample improves the chances of obtaining a good 
DNA profile for an individual. 
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Mitochondrial DNA analysis 


Mitochondrial DNA (mtDNA) extraction has been a conventional technique in DNA 
extraction from hairs when a root is not present for nuclear DNA (nuDNA) extraction. 
Due to the high copy number of the mitochondrial organelle, mtDNA 1s the most abun- 
dant DNA along a hair shaft, and the shaft is the most frequently obtained region from 
which DNA can be extracted for analysis. Hair contains cortical cells, which are a source 
of mtDNA along the hair shaft (Roberts & Calloway, 2007). These cells originate from 
the germinal bulb matrix cells and from melanocytes (Roberts & Calloway, 2007). Mito- 
chondria are found in the keratinized hair shaft (Linch et al., 2001). Multiple copies of 
mtDNA from the mitochondria are present along the hair shaft, making it an abundant 
source of mtDNA (Linch et al., 2001). 

mtDNA analysis has been efficient and effective, but compared to nuDNA, which has 
a higher discriminatory power, mtDNA variation is more limited and is largely limited to 
tracing the maternal lineage of an individual. An individual inherits a single mtDNA 
haplotype from their mother (Linch et al., 2001). Some individuals are found to have het- 
eroplasmy at a single locus or throughout the profile. Heteroplasmy can be defined as a 
heterogenous pool of mtDNA in the cell (Linch et al., 2001). Typically, the control 
mtDNA region is profiled, and two or more mtDNA populations are observed in a single 
individual in cases of heteroplasmy (Huhne et al., 1999). 

Though mtDNA lacks the discrimination power of nuDNA, it can be used to obtain 
DNA profiles to include/exclude individuals. The Scientific Working Group on DNA 
Analysis Methods (SWGDAM) has stated that the discriminatory power of mtDNA is 
dependent on the reported sequence range of highly variable (HV) 1 and HV2 regions 
(Damaso et al., 2021). The longer the length sequence, the more likely a variation 
will be detected. Though these two regions have a lower power of discrimination 
than short tandem repeats (STRs), they still allow for inclusion or exclusion based on 
haplotypes (Damaso et al., 2021). 

Hair can be recovered in one of three phases: anagen (growth), catagen (preparing to 
be shed), and telogen (shedding). mtDNA has been successfully recovered from telogen 
hairs in 92.5% of cases (Brandhagen et al., 2018). The quantity and quality of mtDNA 
decrease substantially along the length of the hair shaft (Brandhagen et al., 2018). This 
is seen as the most challenging when recovering mtDNA from old hairs rather than 
recently shed hairs. The small quantities of mtDNA recovered from old hairs present dif- 
ficulties in obtaining a probative mtDNA profile in forensic cases involving older, 
damaged, or degraded hairs compared to recently shed hairs (Brandhagen et al., 2018). 
Additional factors that contribute to the successful amplification and typing of mtDNA 
are the donor’s race and age due to pigmentation differences (Roberts & Calloway, 
2007). Studies have demonstrated that using shotgun sequencing can become an alterna- 
tive efficient method to successfully obtain complete mtDNA genomes from smaller 
fragments of very old hairs without the presence of a root (Brandhagen et al., 2018). 

Studies have also been conducted on the effects of chemically treating hair and DNA 
analysis. In 2017, one research study explored the effects of hair-color dyes or treatments 
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and the impacts the dyes can have on DNA analysis (Al-Sammarraie et al., 2017). In the 
study, results indicated that hair dye and henna-treated hair impacted DNA analysis re- 
sults (Al-Sammarraie et al., 2017). The study indicated that henna led to a higher recov- 
ered DNA concentration as compared to chemically treated hair (Al-Sammarraie et al., 
2017). One explanation may be that the chemicals in the hair dyes, such as hydrogen 
peroxide, attack the bonds within DNA bases and fragment the DNA (Aparecida de 
Franca et al., 2015). Chemical treatment can also expose hair to harsh conditions such 
as UV light, heat, and other chemicals that are known to degrade DNA (Damaso et 
al., 2021). When scientists attempted secondary and third reamplification and repurifica- 
tion of the color-treated hairs, they were finally able to obtain satisfactory mtDNA results 
(Al-Sammarraie et al., 2017). Melton et al. (2005) stated that having a history of extreme 
environmental exposures in hair results in no mtDNA profile. Extracting mtDNA from 
degraded, old, and chemically treated hairs is more challenging than successfully extract- 
ing mtDNA from recently shed hairs. In challenging circumstances, using a method to 
target smaller mtDNA sequences and using mini primer sets along with the use of 
next-generation sequencing (NGS) will lead to a better collection of mtDNA sequence 
information (Damaso et al., 2021). 


Nuclear DNA analysis 


If a hair root is present, nuDNA extraction is performed. Unfortunately, this is a chal- 
lenge as 95% of the hairs found at the crime scene are telogen hairs without roots (Lepez 
et al., 2014). Finding hair with roots poses a challenge when most hairs found at crime 
scenes are rootless. A 2014 study aimed at developing a fast-screening method for iden- 
tifying hair roots for successful nuDNA analysis (Lepez et al., 2014). 4’,6-diamidino-2- 
phenylindole (DAPI) was used to stain nuDNA in hair roots collected from a crime scene 
(Lepez et al., 2014). Staining can be used to help predict the success of STR analysis 
(Lepez et al., 2014). The method was demonstrated to be nondestructive, quick, and 
inexpensive allowing forensic laboratories to analyze only the most promising hair roots 
containing nuclei (Lepez et al., 2014). 

As the majority of crime scene hairs are rootless, recent research has focused on 
obtaining nuDNA profiles from hair shafts and has led to the success of nuDNA extrac- 
tion from rootless hairs (Brandhagen et al., 2018; Grisedale et al., 2018; Jehaes et al., 
1998). As previously described, hairs are made of dead skin cells that undergo a keratini- 
zation process (Aparecida de Franca et al., 2015). Due to the keratinization process, in 
which cellular organelles and nucleic acids are degraded, it is difficult to obtain nu DNA 
from hair shafts (Brandhagen et al., 2018). The process can lead to DNA degradation, 
particularly nuDNA destruction, within the hair shaft (Brandhagen et al., 2018). 
Although nuDNA has also been obtained from telogen hairs, it is present in low quan- 
tities and varies substantially in quality (Brandhagen et al., 2018). Thus, it is not common 
practice for forensic crime labs to perform nuDNA extraction on telogen hairs simply 
because qPCR assays in implementation are not sensitive enough to detect the low 
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quantity levels of nuDNA (Brandhagen et al., 2018). A study did demonstrate that direct 
amplification of telogen hairs combined with elevated PCR cycles increases the ampli- 
cons to lead to full nnDNA STR profiles approximately 20% of the time (Grisedale 
et al., 2018). The nuDNA typing assay, InnoTyper 21, has been tested and has been 
shown to improve the success rate for DNA typing of degraded rootless hair shafts (Gri- 
sedale et al., 2018). InnoTyper 21 is a small amplicon (60—125 bp) test kit that is reported 
to be very sensitive and provides useful nuDNA profiles from as little as 50 pg of degraded 
DNA (Grisedale et al., 2018). To determine the success of the InnoTyper 21 kit, a sub- 
group of hair samples was tested with the kit, and another was amplified using the 
GlobalFiler PCR amplification kit following the manufacturer-recommended protocol 
(Grisedale et al., 2018). The results showed that InnoTyper 21 was more successful in 
obtaining greater allele recoveries than Globalfiler (Grisedale et al., 2018). Overall, 
42% of all hair samples tested produced genotype recoveries greater than 40% with Inno- 
Typer 21 (Grisedale et al., 2018). Stated differently, 40% of the hair shaft samples led to 
interpretable DNA profiles using the InnoTyper 21 kit (Grisedale et al., 2018). It was 
noted that some donors yielded sufficient DNA profiles for all three of their samples, 
but no DNA was recovered from other donors (Grisedale et al., 2018). All the hairs 
that were analyzed were the same length and collected in the same manner. This indicates 
that the variation in nuDNA recovery and typing success is different among individuals 
and is dependent on the donor of the hair (Grisedale et al., 2018). The success of obtain- 
ing results from hair shafts largely depends on the optimization of the extraction tech- 
nique being used and the fidelity of the sample (Grisedale et al., 2018). Obtaining an 
interpretable nuDNA profile from a single hair shaft is largely dependent upon the 
hair donor, the DNA quantity recovered, and the degradation index (DI) (Grisedale 
et al., 2018). The DI is determined by taking the ratio of the short quantity value over 
the long quantity value (DI= (short)/(long)) (Grisedale et al., 2018). Highly discrimina- 
tory nuDNA was shown to be recovered from rootless hair shafts in the study (Grisedale 
et al., 2018). 

NuDNA has also been “assumed” to be contained in the nuclei of epithelial cells on 
the outer surface of the hair shaft rather than inside the hair shaft in the cortical cells 
(Brandhagen et al., 2018). Two studies have challenged these hypotheses and been 
able to demonstrate success in extracting nuDNA from old and ancient hair samples. 
NuDNA was successfully extracted from degraded and old hair samples from Siberian 
mummies as well as historical hair samples from the Romanov family (Amory et al., 
2007; Loreille et al., 2022). Both of these studies demonstrated the importance of having 
effective, efficient, and clean sampling and DNA extraction methods to successfully 
recover nuDNA from hair shafts. Furthermore, Grisedale et al. (2018) showed high 
discrimination of nuDNA results for over 40% of the hair shaft samples tested and showed 
that out of 60 hair samples, 25 generated successful, interpretable DNA profiles. 
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Application of NGS for hair analysis 


Next generation sequencing (NGS) is an emerging technology in forensic science. The 
impact of NGS in the forensic science community is demonstrating promise in aiding 
improved DNA hair analysis for both mtDNA and nuDNA. The application of NGS 
for mtDNA, STR, and single nucleotide polymorphism (SNP) typing has demonstrated 
great promise for the analysis of forensic evidence, especially in cases in which it is 
degraded and limited (Shih et al., 2018). 

Successful hair nuDNA extractions and case studies have opened the field to future 
studies and inspired scientists to work to improve the methods and techniques to obtain 
the best possible yields of nuDNA from hair shafts. Quantities of nuDNA recovered from 
recent hairs were so low that standard qPCR quantification methods did not yield quant 
values, but nuDNA quantity was assessed using the ratio of nuclear to mt DNA sequence 
reads in the data (Brandhagen et al., 2018). The focus ofa 2018 study was to examine the 
distribution of nuDNA along a hair shaft (Brandhagen et al., 2018). The total percentage 
of sequence reads mapped across the human nuclear genome from the recently collected 
hairs ranged between 88.4% and 99.5% of the total DNA reads (Brandhagen et al., 2018) 
The authors reported that 99% of the shotgun reads that were mapped to the human 
genome were nuDNA sequences (Brandhagen et al., 2018) Surprisingly, these high per- 
centages of nuDNA were observed regardless of the purification protocol (Brandhagen 
et al., 2018). The average size of nuDNA reads varied between 49 and 88 bp for five 
to six extractions (Brandhagen et al., 2018). Even in samples in which the nuDNA 
was determined to be more degraded than the mtDNA, it was found in far higher quan- 
tities in any given hair or hair sample (Brandhagen et al., 2018). Since the size of the nu- 
clear genome is much larger than the mtDNA chromosome, even with the large 
percentages of nuDNA reads, obtaining any reasonable depth of coverage across the 
genome failed (Brandhagen et al., 2018). An Iumina MiSeq FGx NGS instrument 
was used, which is a comparatively low throughput/output instrument; a higher 
throughput and deeper sequencer such as an Illumina NextSeq or HiSeq would lead 
to more mtDNA and nuDNA data (Brandhagen et al., 2018). 

Full STR profiles from telogen hairs present a common challenge due to the integrity 
of the DNA. In 2019, an NGS-based method termed the maSTR assay was developed to 
genotype single telogen hairs (Hef et al., 2019). The comparison of capillary electropho- 
resis (CE)-based STR methods with NGS-based STR methods was conducted to deter- 
mine the best method for achieving successful STR profiles (Heb et al., 2019). The study 
concluded that the maSTR assay yielded higher allele recovery than CE-based STR anal- 
ysis using DNA of single telogen hairs; full or partial profiles were obtained with few 
allelic dropouts and drop-ins (Hef et al., 2019). Extremely degraded DNA samples 
were not able to be typed using STRs by CE or NGS (Hef et al., 2019). CE-based 
profiles resulted in more allelic dropouts and fewer drop-ins than NGS-based profiles 
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(Hef et al., 2019). A CE-based profile resulted in a false exclusion with only one hair 
sample due to the very little DNA and strong degradation, making it impossible to ge- 
notype (Hef et al., 2019). Overall, this study was successfully able to be the first to calcu- 
late likelihood ratios of NGS-based STR profiles of hairs with probabilistic methods in a 
fully continuous manner using the GenoProof Mixture 3 software (Hel et al., 2019). 
The correct assignment of DNA profiles from single telogen hairs to their donor profiles 
is dependent on the quality and integrity of the recovered DNA (Heli et al., 2019). When 
sufficient quality and quantity of DNA are recovered, the hairs can be assigned to their 
donor with statistical significance using both CE and NGS. 

NGS methods have led to advancements in mtDNA profiling of degraded samples, 
mixed samples, and hair samples. In contrast to using Sanger Sequencing, NGS can 
rapidly sequence the whole mitochondrial chromosome of several samples in parallel 
and produce millions of sequence reads in a single sequencing run (Shih et al., 2018). 
NGS technologies can generate 25 million—3 billion reads per sequencing run with 
an output of approximately 15—600 Gb file size (Shih et al., 2018). This large quantity 
of data aids in providing sufficient sequencing information for robust SNP and STR 
variant calling (Shih et al., 2018). Being able to analyze the whole mitochondrial genome 
rather than solely the hypervariable polymorphic regions HVI and HVII can increase the 
discrimination power and alleviate the problem of common sequences found in the HVI 
and HVII regions shared by individuals of the same area (Shih et al., 2018). HVI and 
HVII can often be successfully used in analyzing limited or degraded DNA samples, 
but NGS techniques are more sensitive than previous methods. 

In 2015, a sensitive study was conducted to determine the minimum starting quantity 
of DNA for the Roche 454 HVI/HVII NGS assay (Kim et al., 2015). The assay was 
demonstrated to be highly sensitive for sequencing limited DNA quantities ( ~ 100 
mtDNA copies) (Kim et al., 2015). The assay sensitivity was determined to be 0.5 pg 
or ~ 500 copies (Kim et al., 2015). The DNA quantity could be potentially decreased 
to 0.1 or 0.005 pg by eliminating the dilution after the amplification of low-copy-num- 
ber samples (Hef et al., 2019). Results were obtained from lower amounts of DNA than 
the recommended amount by the manufacturer (Kim et al., 2015). The assay used in the 
study was demonstrated to be specific to the targeted HVI/HVII regions and did not pro- 
duce fragment reads outside these regions (Kim et al., 2015). Sequencing the entire 
mtDNA genome instead of only HVI/HVII regions adds additional and valuable 
discriminating power (Kim et al., 2015). 

Furthermore, NGS can be used to simultaneously analyze mtDNA polymorphisms, 
STRs, SNPs, and indels (Shih et al., 2018). The sensitivity of NGS can be used to 
generate profiles when the sample is below the detection limit for CE (Shih et al., 
2018). The analysis of mtDNA polymorphisms with the addition of SNPs has proven 
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to be valuable when nuDNA is too degraded or not recovered from telogen hair roots 
and hair shafts (Shih et al., 2018). Custom probe capture panels for targeting the whole 
mt genome and 426 nuclear SNPs on an Illumina MiSeq platform from a single shotgun 
library were applied to degraded DNA samples, telogen hair roots, and telogen hair shafts 
(Shih et al., 2018). Using the DNA probes greatly improved, and allowed for the capture 
of both mtDNA and nuclear SNP markers from a single shotgun DNA library without 
the consumption of additional DNA extracts for limited biological forensic samples (Shih 
et al., 2018). The hybrid probe capture NGS system led to mtDNA sequencing results for 
telogen hairs of 100% coverage of the mt genome at > 100x read depth with > 92% 
aligned to the revised Cambridge Reference Sequence (Shih et al., 2018). Even when 
SNP loci dropout occurred in DNA tested from telogen hairs, the partial SNP profile 
combined with the whole mt genome sequence increased the discriminatory power, 
which likely approached or exceeded the discriminatory power of 13 STR loci profiles 
based on the number of SNPs typed (Shih et al., 2018). 

SNPs can be typed using NGS to predict externally visible characteristics of an indi- 
vidual from biological samples such as hair collected in criminal investigations and missing 
person cases (Pospiech et al., 2022). They can also be used to exclude suspects. Prediction 
models for hair color, hair shape, hair loss, and hair graying have been developed from 
genome-wide association studies (GWAS) (Pospiech et al., 2022). Studies explored the 
genetic determinants of hair thickness and density traits and associated them with two 
genes, EDAR and FGFR2; these have been reported to be associated with thick hair 
in Asian populations (Pospiech et al., 2022). Two SNP loci found in EDAR 
(rs3827760 and rs365060) linked to hair thickness in Asians were also associated with 
head hair density in a Polish study with male participants (Pospiech et al., 2022). In addi- 
tion, two SNPs were significant in association with hair thickness (rs731236) and hairiness 
(rs2238136) (Pospiech et al., 2022). NGS can augment anthropology methodology in in- 
vestigations (Pospiech et al., 2022). Accuracy predictions of highly polygenic human 
traits can be challenging, requiring a large number of typed SNPs or reducing strict sig- 
nificance criteria (Pospiech et al., 2022). Ina 2016 study, hair features were predicted us- 
ing GWAS data generated for Latin Americans; the highest accuracy was for hair color 
and beard thickness, with approximately 40% of the phenotypic variation (Adhikari et 
al., 2016). Further research with more populations is needed to develop more advanced 
statistical methods and predictive models for hair thickness and hair density traits 
(Pospiech et al., 2022). Identification of significant SNPs is more effective when corre- 
lations between DNA variants and the cumulative effects of SNPs are considered 
(Pospiech et al., 2022). Features including hair color, thickness, and shape can be pre- 
dicted using commercial NGS forensic DNA typing kits including the Verogen Fore- 
nSeq Signature Prep kit and Imagen kit (Elkins et al., 2023; Jager et al., 2017). 

NGS has been used to develop hybrid methods to generate nuclear genotype profiles 
from rootless hairs that could identify a suspect or victim (Lewis et al., 2022). A custom 
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collection of oligonucleotide probes targeting 215 SNP loci for ancestry and identifica- 
tion was prepared (Lewis et al., 2022). The DNA recovery quantity per length of hair 
varied between individuals and between hairs of the same individual (Lewis et al., 
2022). The recovered DNA/cm hair was approximately 20 pg, and the SNPs ranged 
from 21 to 244 reads per sample (Lewis et al., 2022). Using techniques originally devel- 
oped for ancient DNA typing, the SNP DNA profiles led to a high level of discrimina- 
tion between individuals (Lewis et al., 2022). The key to success was using sufficient 
starting material or more than 2 cm of segments (Lewis et al., 2022). The substantial vari- 
ation between individuals in hair DNA quality and quantity presented a challenge in 
generating useful genotype profiles for all the hair donors in the study (Lewis et al., 2022). 
Continued research in DNA extraction and amplification methods and the develop- 
ment of NGS tools will aid in DNA hair analysis. The impact of NGS on forensic evi- 
dence samples encouraged the United States Federal Bureau of Investigation (FBI) to 
conduct a validation study on the application of NGS for mtDNA casework within 
the FBI laboratory (Brandhagen et al., 2020). The validation studies were both develop- 
mentally and internally validated and were performed in accordance with the SVDAM 
Validation Guidelines for Forensic DNA Analysis Methods and the FBI’s Quality Assurance 
Standards for Forensic DNA Testing Laboratories (Brandhagen et al., 2020). The FBI’s 
work demonstrated the efficiency, accuracy, reproducibility, sensitivity, and reliability 
of the PowerSeq CRM Nested System chemistry (Brandhagen et al., 2020). The Power- 
Seq assay demonstrated substantial advantages when it came to laboratory workflow 
(Brandhagen et al., 2020). The PowerSeq assay can be applied uniformly to all sample 
types without any changes to workflow or level of effort (Brandhagen et al., 2020). 
This assay simplified and improved the efficiency of the mtDNA analysis process (Brand- 
hagen et al., 2020). The success of validation was submitted to the National DNA Index 
System (NDIS) and was accepted and approved in May 2019 (Brandhagen et al., 2020). 
The successful validation led to the approval of mtDNA control region data generated 
using the PowerSeq assay to be uploaded to the US Combined DNA Index System 
(CODIS) (Brandhagen et al., 2020). Sequencing only the control region is cheaper 
and compatible with traditional HVI/HVII mtDNA typing without adding the com- 
plexities that typing the complete mt genome data presents (Brandhagen et al., 2020). 


Conclusion 


Interest in the genetic analysis of hair is increasing with the more widespread availability 
of NGS tools. Extraction methods have demonstrated the ability to isolate mt DNA and 
nuDNaA from recent and ancient samples. As NGS continues to improve and advance, 
future NGS implementation appears promising for augmenting the analysis of DNA 
for forensic hair evidence samples. As NGS methods continue to be approved for CODIS 
upload, adoption is expected to increase. 
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Introduction 


The concept of DNA fingerprinting emerged in 1985 when Professor Sir Alec Jeffreys 
used a variable number of tandem repeats (VNTRs) to generate individual-specific 
DNA patterns—known as DNA fingerprints—that could be applied to both human 
identification and paternity testing (Jeffreys, Wilson et al., 1985). The VNTR technique 
was first described as an effective tool when it was used to help in immigration cases 
(Jeffreys, Brookfield et al., 1985) and to solve murder cases (Cobain, 2016). However, 
studies have highlighted a major limitation of the VNTR technique, which requires 
high-molecular-weight DNA. Therefore, DNA fragment analysis by the VNTR tech- 
nique is unsuitable for low-molecular-weight and degraded DNA. Poor DNA quantity 
and quality are two distinct phenomena that are commonly observed in forensic DNA 
analysis. 

Since the early 1990s, with the invention of the polymerase chain reaction (PCR), 
DNA analysis has taken an enormous step forward in its ability to generate genetic pro- 
files from small amounts of DNA. The PCR makes it possible to analyze the low 
amounts of DNA often retrieved from forensic samples. In addition, using short tandem 
repeat (STR) polymorphism and PCR has contributed to increased discrimination power 
between forensic DNA samples. Amplification of STRs has facilitated their application in 
analyzing short DNA fragments, such as degraded DNA. Consequently, STR analysis has 
replaced the VNTR method to become the standard technique used in the forensic 
laboratory workflow of human DNA identification (Edwards et al., 1991, 1992). The 
diversity and high discriminatory power of STR markers allow them to be used not 
only for human identification but also in population genetics studies and paternity cases. 
Despite their robustness, using STR markers has several limitations that make it 
challenging to obtain interpretable STR profiles. These limitations include their large 
amplicon size, high mutation rates [10~° per locus per generation (Brinkmann et al., 
1998)], and artifacts such as stutter peaks. Therefore, single nucleotide polymorphism 
(SNP) typing is an alternative approach to address these shortcomings of STRs. 
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Compared to STRs, SNPs possess lower mutation rates (10 * per locus per generation) 
(Durrett & Limic, 2001; Reich et al., 2002; Weber & Wong, 1993). 

Moreover, the small size of their PCR amplicons makes SNPs more useful for 
analyzing highly degraded and low-amount DNA samples than STR amplicons 
(Budowle, 2004), thus overcoming the major limitation of using STRs for highly 
degraded DNA samples (Sanchez et al., 2006). This advantage can be used in the forensic 
analysis of mass disasters and missing-person cases where DNA may be severely degraded. 
SNP markers also avoid one of the interpretational complexities associated with STR 
typing by generating DNA profiles without stutter peaks. In addition, several SNPs 
located within DNA coding or regulatory regions can lead to amino acid substitutions 
that may modify gene function properties, which, in turn, could result in the expression 
of distinct visible phenotypic traits. Recent evidence suggests that SNPs are the most 
prevalent type of genetic variation that significantly affects visible phenotypic traits. 
This provides a valuable opportunity for forensic investigations to potentially identify 
an unknown individual from a mere DNA trace, especially when a reference sample is 
unavailable for a DNA match. 

The development of technologies that could determine an individual’s genotype by 
scanning large numbers of SNPs has revolutionized the study of phenotypic traits in fo- 
rensics. Genome-wide association studies (GWAS) were used to identify SNPs associated 
with a particular phenotype or disease across the genome of large populations or cohorts 
of individuals by screening hundreds of thousands to millions of SNPs. Some variants 
showed a significant association with the phenotype of interest and were further validated 
and subjected to functional analysis to establish their biological significance. However, 
GWAS has limitations, including the potential for false-positive findings due to multiple 
testing, and population stratification. In multiple testing, multiple hypotheses are tested 
simultaneously, leading to an increased risk of observing significant results by chance 
alone. In population stratification, the subgroups have different allele frequencies due 
to differences in ancestry (Haidar et al., 2021), which can lead to spurious associations 
between SNPs and phenotypes if not adequately controlled (Price et al., 2006). Several 
improvements in the GWAS methodology have been developed to address these limi- 
tations, including using more sophisticated statistical algorithms, applying rigorous 
quality-control measures to eliminate unreliable SNPs, and incorporating genetic 
ancestry information to control for population stratification (Wray & Goddard, 2010). 
The GWAS approach has driven forensic geneticists to study SNP markers that are likely 
to be associated with physical traits to predict an individual’s phenotype. Recently, ge- 
netic profiling using SNP markers has been carried out using DNA sequencing technol- 
ogy, such as mini-sequencing, Sanger sequencing, and next-generation sequencing 
(NGS), the latter representing a milestone in human identification in forensic analysis. 
This chapter is dedicated to exploring the relationship between phenotypic- 
informative SNPs and externally visible traits. Additionally, it will delve into the use 
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of NGS technology in testing these SNP-associated phenotypic inferences in DNA 
forensics. The expansion on these topics sheds light on their significance in forensic 
research and analysis. 


Progression to next-generation sequencing 


In the 1970s, two technologies were introduced to ‘read’ or sequence DNA, including 
the dideoxy chain termination method engineered by Sanger, known as Sanger 
sequencing, and the chemical modification method developed by Maxam and Gilbert, 
called Maxam—Gilbert sequencing. Both methods use radioactive labels to infer the or- 
der of nucleotides. Sanger sequencing involves denaturing the target DNA and annealing 
it to a tagged primer. The DNA polymerase then extends the primer, incorporating 
chain-terminating nucleotides to create different-sized fragments (Sanger et al., 1977). 
In contrast, Maxam-Gilbert sequencing utilizes a terminally tagged double-stranded 
DNA fragment that is digested by a restriction enzyme. Different combinations of chem- 
icals are then used to create base-specific segments of DNA (Maxam & Gilbert, 1977). 
Both methods determine base order using electrophoresis to separate DNA fragments 
on polyacrylamide slab gels. 

By 1987, several enhancements had been made to the Sanger method, which mainly 
involved using a fluorescent dye instead of a radioactive label in a single reaction and 
developing fragment detection using capillary electrophoresis (CE) (Ansorge et al., 
1986; Smith et al., 1985). Subsequently, the first generation of automated 
fluorescence-based sequencing was developed by Applied Biosystems (ABI) using the 
Sanger technique (Hood et al., 1987; Smith et al., 1986), which was the quantum leap 
that changed the progress of DNA sequencing technology. These improvements led 
to substantial growth in sequencing data, from half a million bases in 1982 to around 
10 million bases by 1986 (Shendure et al., 2017). The improvement also included an in- 
crease in the number of capillaries in the sequencing instruments. The first capillary 
sequencer, ABI 310, was limited to one capillary, whereas the latest improved model, 
AB 3730xl, has 96 capillaries. The latter sequencer can deal with larger fragment sizes 
and has a shorter run time (Guzvic, 2013). Such improvements significantly impacted 
the sequencing process of the Human Genome Project, the largest and most valuable 
project based on the Sanger sequencing technique. 

However, the Sanger method has major drawbacks, such as its high cost, low 
sequencing throughput, laborious operating procedures, and limitations in analyzing 
complex genomes (Fullwood et al., 2009; Metzker, 2010). These limitations have led 
to the development of new platforms that can determine an individual’s genotype by 
scanning large numbers of SNPs associated with phenotypic traits. The past few decades 
have witnessed the introduction of massively parallel sequencing, also termed NGS, with 
an improved DNA sequencing technique that outperforms the traditional Sanger 
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method in terms of cost, throughput capacity, and sequencing efficiency. NGS is a 
high-throughput sequencing technology that sequences DNA templates along the hu- 
man genome in a massively parallel manner to generate millions of DNA sequences 
(reads). The introduction of NGS has allowed rapid progress in many biological fields, 
such as clinical genetics (Krier et al., 2016; McCarthy et al., 2013), human history (Mar- 
ciniak & Perry, 2017), ancient DNA analysis (Poinar et al., 2006), microbiology (Motro 
& Moran-Gilad, 2017), genetics (Abecasis et al., 2012), and forensic science (Palencia- 
Madrid & de Pancorbo, 2015; Weber-Lehmann et al., 2014). NGS technologies are 
routinely used to sequence whole human genomes as well as target regions to permit 
the sequencing of a subset of the whole genome. Targeted NGS technologies are capable 
of analyzing large numbers of markers in a single run, including SNPs, STRs, insertion 
and deletion polymorphisms, structural variants, and copy number variants (Metzker, 
2010). 


Phenotypic-informative SNP in forensics 


Several studies have shown that SNPs are the most common type of genetic variation 
marker associated with certain physical traits of individuals that can serve as a forensic 
tool. Due to their low mutation rates, SNPs are inherited in stable patterns across gener- 
ations, making them suitable markers for studying ancestral relationships and for use in 
phenotypic prediction studies (Budowle & Van Daal, 2008). SNP markers that are asso- 
ciated with a particular physical trait can be used to predict a person’s physical appearance 
and are commonly referred to as phenotypic-informative SNPs. Hence, this process is 
known as forensic DNA phenotyping (FDP). It has been reported that FDP using 
phenotypic-informative SNP markers can act as “biological eyewitnesses,” predicting 
certain crime suspect/perpetrator phenotypes and supporting human witnesses to identify 
an individual (Kayser, 2015; Valentine & Fitzgerald, 2016). Phenotype inference facili- 
tates the prediction of physical attributes including hair, eye, and skin pigmentation, in 
addition to facial morphology, all derived from DNA samples. This technology has 
several potential applications in forensic investigations, including identifying individuals 
and establishing their physical characteristics when traditional methods have proven 
ineffective. 

As previously discussed, NGS technology significantly facilitates the sequencing of 
large amounts of DNA spanning an individual’s genome. This includes regions that 
encapsulate data pertaining to the physical attributes of the individual. One of the pivotal 
methods employed for phenotype inference via NGS involves sequencing targeted 
phenotypic-informative SNPs associated with specific physical traits. This technique 
aims to identify an unknown individual by forecasting their physical characteristics 
derived from a DNA sample. This technology has contributed to narrowing down lists 
of suspects or victims and expediting investigative timelines. Furthermore, phenotype 
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inference proves invaluable in the identification of human remains, particularly when 
traditional identification methods like STR analysis are rendered infeasible due to factors 
such as decomposition or the unavailability of dental records or fingerprints. 


Externally visible characteristics 


As highlighted in previous sections, the FDP is defined as predicting an individual’s exter- 
nally visible characteristics (EVCs) through the use of informative genetic SNPs. This 
technique has been particularly instrumental for forensic applications, aiding in the iden- 
tification of both suspects and victims in criminal investigations. Nonetheless, it is essen- 
tial to note that this DNA intelligence approach functions primarily as an investigative 
tool as opposed to a method for identifying specific individuals (Kayser, 2015). This 
distinction stems from its predominant use in excluding suspects from a crime scene 
investigation where their DNA does not correspond with the EVCs inferred from 
DNA samples procured from the scene. Such technology undoubtedly bolsters law 
enforcement efforts, particularly in criminal investigations and missing person cases 
involving unidentified individuals. Indeed, the FDP methodology has proven successful 
in identifying human remains and victims of disasters. Notable instances of successful FDP 
applications include the resolution of the Louisiana serial killer case (Fullwiley, 2008) and 
the Madrid train attacks (Phillips et al., 2007, 2009). Furthermore, the utility of FDP ex- 
tends beyond humans, finding application in breed classification and skeletal phenotype 
prediction based on canine DNA evidence (Raymond et al., 2022). 

Several studies have shown the association between specific phenotypic traits and 
their genetic determinants in forensics. However, several challenges necessitate resolution 
prior to the routine incorporation of FDP into investigative procedures. For example, 
there are limitations to the accuracy and precision of current predictions of certain 
FDP tests, which, in turn, are reflected by the average prediction accuracy of EVCs. 
Research has identified variations in the accuracy of different attributes. These include 
the predictive accuracy of SNPs associated with factors such as height (Aulchenko 
et al., 2009; Liu et al., 2014), male pattern baldness (Redler et al., 2012; Richards 
et al., 2008), hair morphology (Fujimoto et al., 2008, 2009; Medland et al., 2009; 
Pospiech et al., 2018), and facial structure (Boehringer et al., 2011; Claes et al., 2014; 
Sheehan & Nachman, 2014). 

Physical traits such as hair and eye color and facial structure are complex because mul- 
tiple genes control them. They are also influenced by environmental factors such as 
climate, ultraviolet (UV) light exposure, and nutrition. Consequently, their variation is 
continuous. This results in a continuum of variation, and, as such, DNA tests cannot ac- 
count for these nongenetic components. Apart from genetic heritability and environ- 
mental effects, several other factors can impact the prediction accuracy of EVCs. 
These encompass the DNA selection and genotyping method, the informativeness of 
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the SNPs, and the reference dataset employed. Given these intricacies, it is crucial to 
address certain legal and ethical considerations when deploying NGS technology for 
phenotype inference. Such technology must adhere to legal and ethical principles to pro- 
tect an individual’s privacy and dignity. It is essential to avoid discriminatory practices, 
segregation, biased judgments, or ethnic persecution based on phenotype inference for 
specific population groups (Schneider et al., 2019). Governments should ensure that 
NGS-based phenotype inference is conducted in compliance with the due process of 
law and by forensic professionals who uphold rigorous standards. 

More research and collaborative efforts are needed for a comprehensive understand- 
ing of predictive SNPs associated with complex EVCs. This enhanced understanding 
stands to augment the accuracy and precision of FDP. Forensic DNA tests for analyzing 
DNA markers that cause or are associated with particular features, such as eye, hair, and 
skin color, have been developed and validated using suitable statistical predictive models 
(Boehringer et al., 2011; Budowle & Van Daal, 2008). Furthermore, pigmentation traits 
are among the least genetically complex of all EVCs, with a majority of the phenotypic 
information being conveyed by a modest number of genes. As a result, the prediction of 
traits associated with these features yields relatively accurate outcomes (Boehringer et al., 
2011; Budowle & Van Daal, 2008). The EVCs discussed in this chapter, including eye, 
hair, and skin color, along with facial features, represent the most extensively researched 
in the field of forensics. These EVCs are particularly important as they furnish the most 
informed determinations in investigations when in pursuit of and identifying suspects. 


Eye color 


Eye color is an easily observable yet genetically complex characteristic. The study of eye 
color genetics has a rich history, dating back to the early 1900s, when scientists first began 
examining the inheritance patterns of eye color (Davenport & Davenport, 1907). The 
color of an individual’s iris is determined by the production and distribution of pigment, 
which are controlled by multiple genes and factors. Understanding the genetics of eye 
color has significant implications, including identifying disease risks, developing person- 
alized medicines, and conducting forensic investigations. 

The OCA2 gene on Chromosome 15 is the most significant contributor to human 
eye color variation (Sturm & Larsson, 2009). This gene codes for a protein involved in 
melanin pigment formation and distribution in the iris. OCA2 SNPs are significantly 
associated with determining human eye color (Branicki et al., 2008; Duffy et al., 
2007). Further research using GWAS and linkage analysis has identified the HERC2 
gene, located near OCA2 and specifically linked to blue eyes (Kayser et al., 2008). In 
addition to blue, various eye colors, such as green, gray, and hazel, are observed in the 
human population. These colors are thought to result from the interplay of multiple ge- 
netic factors and the amount and type of melanin in the iris (Sturm et al., 2008). One 
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variation in HERC2, rs12913832, is particularly linked to blue eyes, affecting OCA2 
expression regulation. Individuals with two copies of the “A” allele of rs12913832 
tend to have less melanin in their irizes and are more likely to have blue eyes. This allele 
is more common in individuals of European descent, which could explain the higher 
proportion of blue-eyed individuals in that population (Donnelly et al., 2012; Sturm 
et al., 2008). 

IrisPlex, the first highly sensitive multiple genotyping assay for predicting blue/brown 
eye color, marked a significant milestone in eye color prediction (Walsh et al., 2011, 
2012). The IrisPlex system targets six SNPs—rs12913832 (HERC2), 131800407 
(OCAZ2), 1812896399 (LOC105370627), 1s16891982 (SLC45A2), 131393350 (TYR), 
and rs12203592 (IRF4)—that are involved in eye color determination in the European 
population. These SNPs were selected based on their linkage with specific eye colors and 
allele frequencies in different populations. However, IrisPlex’s accuracy in predicting in- 
termediate eye colors is lower than for blue and brown eyes (Liu et al., 2009; Walsh et al., 
2012). Furthermore, its efficacy in predicting blue/brown eye color decreases when 
tested on more heterogeneous samples, such as those with Asian, African, and South 
American ancestry (Keating et al., 2013). The IrisPlex genetic test, initially developed 
based on a Dutch database, has shown varying degrees of predictive accuracy across 
different European subpopulations, with slightly lower effectiveness in individuals 
from Southern Europe than those from Western Europe (Dario et al., 2015; Purps 
et al., 2011; Salvoro et al., 2019). Therefore, the identification of the optimal SNPs 
for determining EVCs is markedly influenced by population-specific factors, indicating 
that the panels employed in predicting eye color should be tailored to specific geograph- 
ical regions. Future research directions in the field include refining prediction models, 
expanding them to include additional populations, and exploring the use of artificial in- 
telligence and machine learning to enhance eye color prediction accuracy. 


Hair color 


Hair pigmentation is one of the most visible phenotypic traits in human appearance, with 
significant variation observed between individuals. The hair phenotype can be used in 
forensic contexts for identification purposes and has been a subject of interest to genetics 
researchers for years. The pigmentation of human hair involves various biological pro- 
cesses that determine its color, texture, and growth rate. At the core of hair pigmentation 
is melanin, a pigment produced in the melanocyte cells of the hair follicle. Melanin syn- 
thesis occurs via two distinct pathways: Eumelanin, which results in brown and black 
hair, and pheomelanin, which determines red and yellow hair. The primary gene respon- 
sible for determining hair color is the melanocortin 1 receptor gene (MC1R), which reg- 
ulates melanocyte production of melanin pigments. Studies have shown that variants of 
MCIR have a significant impact on the expression of these pigments, and therefore on 
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hair color; for instance, a common MC1R mutation is the R151C gene variant, which 
decreases the eumelanin pigmentation and increases the pheomelanin pigmentation, 
resulting in blonde or red hair (Dufty et al., 2004). Interestingly, some genes that 
determine hair pigmentation are also linked to skin pigmentation, suggesting a relation- 
ship between these two characteristics. The SLC24A5 gene, for example, plays a role in 
both the production of melanin in the skin and melanocytes, and so it is associated with 
skin pigmentation as well as lighter hair color (Lamason et al., 2005). Forensic scientists 
target specific SNPs that correlate with hair color. These SNPs include MC1R, TYR, 
TYRP1, SLC24A5, SLC45A2, OCA2, and HERC2 (Valenzuela et al., 2010; van 
Daal, 2008). The MC1R SNP is one of the most common and significant SNPs used 
to determine hair color. Variants of this SNP have been found to influence the expression 
of pheomelanin and eumelanin (.e., the pigments responsible for red/yellow and 
brown/black hair, respectively) (Rees, 2003). 

A new forensic phenotyping assay, HIrisPlex, has been developed as an enhanced 
version of IrisPlex (Walsh et al., 2013). HIrisPlex comprises 24 forensically validated 
phenotypic markers in a single multiplex, including 18 new hair-color polymorphisms 
(10 MC1R SNPs, TUBB3, 2 SLC45A2 SNPs, KITLG, LOC105374875, SLC24A4, 
PIGU, and TYRP1), along with the 6 IrisPlex SNPs (Walsh et al., 2013). These markers 
have been successfully used several times as predictive tools for analyzing ancient skeletal 
remains (Chaitanya et al., 2017; Draus-Barini et al., 2013; Haeusler et al., 2016; King 
et al., 2014). The HIrisPlex prediction has been evaluated for individuals across Europe. 
The prognostication accuracies are nearly 70% for blond hair, 80% for brown, 80% for 
red, and 88% for black for individuals from various regions in Europe. In order to estab- 
lish its validity beyond European populations, Walsh et al. applied the HIrisPlex analysis 
to DNA samples from the HGDP-CEPH human genome diversity cell line panel, 
consisting of global datasets representing 51 distinct ethnicities (Walsh et al., 2014). 
The study found that the strength of HIrisPlex eye and hair color probabilities can enable 
an inference accuracy exceeding 86% regarding whether a person with brown eyes and 
black hair originates from nonEuropean regions or European regions. This information 
holds the potential to enhance forensic phenotyping in the future. 

One of the advantages of the HIrisPlex genotyping assay is that its sensitivity exceeds 
that of numerous other methods used in forensic settings because it can produce complete 
profiles down to a minimum 63-pg DNA input. In addition, it has been demonstrated to 
yield precise and trustworthy results when employed on simulated casework specimens 
composed of diverse biological materials, such as blood, semen, saliva, hair, and trace 
amounts of DNA (Walsh et al., 2014). This holds promise for use in identifying missing 
persons or victims of disasters when antemortem samples and possible relatives are not 
accessible. By using DNA-predicted characteristics, such as eye and hair color, such in- 
dividuals may be located and making them available for STR profiling that can enable 
their accurate identification. However, it is essential to consider the limitations of the 
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HIrisPlex genotyping assay. For instance, while the system has demonstrated high accu- 
racy in predicting hair and eye color, it may be less effective in populations with less- 
studied genetic backgrounds. Furthermore, the accuracy of the predictions can be influ- 
enced by various factors, such as environmental conditions, age, and individual genetic 
variations. Therefore, future research should continue to refine and expand the scope 
of the HIrisPlex assay to improve its accuracy and applicability in a broader range of 
forensic contexts. 


Skin color 


Skin color is a complex trait determined by various genetic factors and environmental 
influences. The primary determinant of skin color is the amount and distribution of 
melanin produced by melanocytes, which are specialized skin cells in the epidermis basal 
layer. Melanocytes produce melanin in specialized organelles called melanosomes, which 
are transferred to the surrounding skin cells to protect against UV radiation. The most 
significant genetic factor affecting skin color is the 1MC1R gene, which regulates the 
production of eumelanin. Variations in the MC1R gene can result in different levels 
of eumelanin production, leading to various skin colors (Kanetsky et al., 2004). More- 
over, research has shown that variations in the HERC2 gene are associated with light 
and dark skin colors. Other genes that play a role in skin color include SLC24A5, 
SLC45A2, and TYR, which can initiate melanin production, and OCA2, which regu- 
lates the size and distribution of melanosomes (Andrade et al., 2017; Pospiech et al., 2014; 
Soejima & Koda, 2007). 

Additional genes, such as KIT ligand (KITLG) and agouti signaling protein (ASIP), 
also play significant roles in the variation of skin pigmentation. KITLG plays a crucial 
role in the development and migration of melanocytes, ultimately influencing skin color 
(Wang et al., 2021). Conversely, ASIP contributes to skin pigmentation by regulating the 
production of eumelanin and pheomelanin (Bonilla et al., 2005). These two forms of 
melanin are responsible for the coloration of skin, hair, and eyes. Environmental factors, 
such as exposure to sunlight and UV radiation, can also influence skin color. UV radia- 
tion stimulates melanocytes to produce more melanin, resulting in skin darkening or tan- 
ning as a protective measure against DNA damage and skin cancer (Jablonski & Chaplin, 
2010). Over time, populations living closer to the equator, where UV radiation is more 
intense, have developed darker skin, while populations living farther from the equator, 
where UV radiation is less intense, have developed lighter skin. 

A novel tool called HIrisPlex-S, which simultaneously predicts eye, hair, and skin co- 
lor, was developed as an advancement of the HIrisPlex system. HIrisPlex-S employs 36 
SNPs for making skin color predictions, incorporating 19 from the previous model and 
introducing 17 new markers, including ANKRD11, ASIP, BNC2, DEF8, 4 HERC2 
SNPs, MC1R, 4 OCA2 SNPs, RALY, SLC24A4, SLC24A5, and TYR. HlrisPlex-S 
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has demonstrated its ability to predict skin color on a worldwide scale, achieving high 
accuracy rates in predicting dark, intermediate, pale, and very pale colors (Breslin 
et al., 2019). 


Face feature 


Facial morphology prediction is a process that has gained immense interest in various 
scientific fields, including genetics, anthropology, and forensics. As a powerful tool 
with significant potential for its application in forensic contexts and various investigative 
purposes, it can aid law enforcement agencies in identifying suspects, reconstructing vic- 
tims’ appearances, and facilitating missing person investigations. However, using genetic 
markers to accurately predict facial traits is much more complex than eye, hair, and skin 
pigmentation. This complexity is due to the contribution of multiple genes associated 
with facial feature diversity. Therefore, accurate or reliable prediction of facial 
morphology continues to remain challenging. However, over recent years, significant 
advances in genomics have enabled the identification of key genes and genetic variants 
linked to facial feature development. Some SNPs within these genes have been analyzed 
for their impact on facial measurements, such as the distance between eyeballs, nose 
morphology, and mandible structure (Boehringer et al., 2011). 

These advances have primarily stemmed from integrating sequencing technologies 
and GWAS scanning. NGS has allowed for rapid and cost-effective sequencing of large 
populations, which in turn has facilitated the discovery of genetic variants linked to facial 
features. Coupled with GWAS data, scientists have developed computational algorithms, 
statistical models, and high-resolution three-dimensional imaging that greatly enhance 
the capacity to predict facial landmarks in great detail and with approximate accuracy 
(Shaffer et al., 2016). Initial breakthroughs in understanding facial morphology were 
achieved through two GWAS studies conducted on Europeans (Liu et al., 2012; 
Paternoster et al., 2012). They identified a significant association between the 
rs7559271 SNP, located within an intron of the PAX3 gene, and nasal bridge 
morphology. This association was also observed in other populations, highlighting 
PAX3’s significant role in facial development (Adhikari et al., 2016). Furthermore, 
four novel genetic variants close to the genes PRDM16, TP63, CS5orf50, and 
COL17A1 have been found to influence facial structure development (Liu et al., 
2012). These data imply that such genetic variants are essential for face development 
and involve the normal variation of human facial morphology. 

Moreover, a study has investigated the genetics behind facial shape in individuals of 
European ancestry and identified two novel genetic associations:SNP 19456748 within 
an intron of PARK2 with midface height and SNP 1872713618 within FREM1 with 
central upper lip height (Lee et al., 2017). While the biological role of PARK2 in cranio- 
facial development remains unclear, evidence suggests that FREM1 may influence facial 
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variation. These findings highlight the potential of data-driven multivariate phenotyping 
in facial morphology research. In terms of nose shape, a separate GWAS study reported 
five genetic variants located near the genes DCHS2, RUNX2, GLI3, PAX1, and EDAR 
(Adhikari et al., 2016). Each of these genes plays a role in various developmental processes 
that influence the formation and shape of the nose. For example, PAX1 contributes to 
skeletal development, potentially affecting nasal bone and cartilage, while EDAR affects 
ectodermal development, impacting the external part of the nose. 

There is strong evidence that the development of facial traits undergoes genetic regu- 
lation to create an inherited, unique, and recognizable face, which plays an obvious role 
in physical identification. For example, studying the genetic basis of facial development in 
African children and adolescents identified two genes, SCHIP1 and PDE8A, associated 
with facial size, in addition to ten other loci linked to facial shape (Cole et al., 2016). 
Mouse models further supported these findings, in which genetic variants of SCHIP1 
and PDE8A were involved in mouse facial morphogenesis. This study highlighted 
differences from previous studies on individuals of European ancestry, suggesting a link 
between facial traits and population genetic structure and providing evidence that 
ancestry could affect facial morphology. Indeed, the importance of ancestry markers in 
determining facial features was demonstrated, and the researchers were able to predict 
facial feature models with a certain degree of reliability by constructing a face using these 
ancestry markers and then adding other SNPs (Claes et al., 2014, 2018). These models 
have successfully predicted specific facial features such as eyebrow width and the distance 
between the eyes. 

Additionally, a study on the Korean population identified five significant genetic var- 
jants associated with facial morphology. For example, the OSR1-WDR35 locus 
(ts7567283, G allele) was associated with the frontal facial contour, HOXD1-MTX2 
(rs970797, A allele) as well as WDR27 (183736712, C allele) loci were associated with 
eye shape; and SOX9 (182193054, C allele) as well as DHX35 (182206437, A allele) 
loci were associated with nose shape, specifically nasal angle and subnasal width, respec- 
tively. Comparing with the collective data from the genome-wide project, this study 
showed that allele frequencies of four SNPs associated with nose shape have differed 
among ethnic populations (Cha et al., 2018), suggesting that the shape of the noseis 
genetically inherited. 

The Snapshot FDP System is a DNA analysis tool developed by Parabon Nanolabs 
that can predict facial morphology, sex, eye, hair, and skin color, freckling, and biogeo- 
graphic ancestry in a single analysis. The system uses DNA samples from crime scenes to 
profile a potential suspect’s appearance. Interestingly, some law enforcement agencies 
have used the system to solve cases. Nevertheless, further research is needed to identify 
additional SNPs associated with facial features to improve the system’s accuracy level, 
which may vary depending on the quality and quantity of the DNA sample. 
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Next-generation sequencing in forensics 


The PCR-CE-based analysis methods are widely used in forensic laboratories for typing 
forensic genetic markers, such as STRs, SNPs, and recently introduced indel markers 
(Haidar et al., 2020; LaRue et al., 2012). STRs are the most discriminative and 
commonly used markers in forensics due to their multiallelic state. In contrast, most 
SNPs exist in a diallelic state, which limits their discriminatory power. It is estimated 
that 50 SNPs may be as informative as only 13 STRs, meaning that a single STR can 
match the identification power of 4—5 SNPs (Gill, 2001; Pakstis et al., 2010). Forensic 
genetic markers, including STRs and SNPs, are found in coding and noncoding regions 
of the human genome. However, they require individual assays for each class of markers, 
as they cannot be typed together. This necessitates additional PCR assays and more of the 
often-limited DNA extract from forensic samples. Moreover, SNP markers have 
limitations in forensic analysis, such as requiring more DNA than STRs and more 
labor-intensive analysis methods, like Sanger sequencing, which is impractical for forensic 
analysts. It also requires a high level of expertize to perform and interpret, which might 
not always be available in a forensic lab. 

NGS technology has advanced forensic practice by allowing for the coamplification 
and individual sequencing of hundreds of SNPs, in addition to other classes of forensic 
genetic markers, in a single assay. This saves time, especially when additional DNA 
investigations are needed. In addition, unlike the CE method, which only measures 
the length of the overall PCR products and can result in partial profiles, NGS provides 
additional data measurement by analyzing the targeted STR alleles’ sequence and accom- 
panying stutter products. NGS can coamplify hundreds of different types of markers in a 
single run, marking a new era of high-resolution sequencing in forensic applications. In 
addition, it enables simultaneous analysis of a combination of STRs and different SNP 
classes, such as biogeographical ancestry SNPs (associated with a person’s geographical 
origin) and phenotypic SNPs of EVCs, in a single sequencing run. This substantially in- 
creases the sequencing data’s overall informativeness and forensic testing’s discriminatory 
power while decreasing match probabilities by several orders of magnitude. 

One significant advantage of sequencing technologies is the ability to utilize haplo- 
types, consisting of groups of closely linked SNPs on a single chromosome inherited as 
a single unit. This approach enables a more comprehensive understanding of genetic 
information and relationships. In the context of forensic science, microhaplotypes, also 
known as microhaps, have emerged as valuable tools for human identification. These 
multi-SNP haplotypes comprise two or more closely linked SNPs separated by less 
than 200 base pairs (Kidd et al., 2018; Oldoni et al., 2019). Microhaplotypes have gained 
recent attention from the forensics community as promising alternatives to separately 
genotyping individual SNPs and STRs for human identification. Microhaplotypes allow 
forensic scientists to obtain more information from a DNA sample, which leads to 
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increased accuracy and efficiency in human identification efforts. However, it is essential 
to note that not all informative-phenotypic SNPs may form phenotypic microhaps. 
Therefore, accurate phenotype prediction requires genotyping a large number of indi- 
vidual SNPs (Kidd et al., 2014, 2018; Oldoni et al., 2019). Despite this limitation, the 
ability to analyze both haplotypes and microhaplotypes in conjunction with other genetic 
markers further demonstrates the versatility and utility of sequencing technologies in 
forensic applications and potentially generates more information from a smaller amount 
of starting material compared to some traditional methods. 

NGS technologies have become the new standard in the field, with several platforms 
developed to cater to different sequencing chemistries, data outputs, and applications. 
Examples of these platforms include Illumina, Ion Torrent, Pacific Biosciences, and 
Oxford Nanopore Technologies. The choice of the NGS platform depends on factors 
such as read length, data output, sequencing accuracy, cost, and turnaround time. In 
forensic applications, NGS has been utilized to simultaneously sequence various markers, 
such as STRs and SNPs, enabling a more comprehensive analysis of DNA samples. The 
Ion Torrent platform, developed by Thermo Fisher Scientific, is another widely used 
NGS technology that relies on a unique sequencing-by-synthesis approach. Unlike 
the fluorescence-based detection utilized by Illumina, Ion Torrent measures the release 
of hydrogen ions as nucleotides are incorporated into the growing DNA strand. This 
method allows for real-time detection of the nucleotide sequence through 
semiconductor-based sensors. The Ion Torrent platform is known for its rapid 
sequencing, lower cost, and simplified library preparation compared to other NGS 
platforms. 

For forensic applications, Thermo Fisher Scientific has developed the Precision ID 
NGS System for human identification, which includes the lon AmpliSeq DNA panels 
and the Ion S5 sequencing platform. The Precision ID NGS System is designed to 
streamline the process of DNA profiling and enable simultaneous analysis of a wide range 
of markers, including autosomal STRs, Y-STRs, X-STRs, and SNPs. The Ion AmpliSeq 
DNA panels are a collection of ready-to-use primer sets that target specific regions of in- 
terest in forensic genomics. These panels facilitate multiplex PCR amplification of 
numerous markers in a single reaction, saving time and reducing the required input 
DNA. The panels available in the Precision ID System include the GlobalFiler NGS 
STR Panel, the Precision ID Identity Panel, the Precision ID Ancestry Panel, and the 
Precision ID mtDNA Whole Genome Panel, each designed for specific forensic 
applications. 

The MiSeq FGx forensic genomics system, developed by Illumina and now part of 
Verogen, is a groundbreaking NGS platform designed for forensic genomic applications. 
The system leverages the well-established Illumina sequencing technology, which uses a 
synthesis approach for labeling DNA. This approach involves the incorporation of 
dNTPs during the DNA polymerization process. As each dNTP 1s added to the growing 
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DNA strand, it emits a unique fluorescent signal, which is captured using high-resolution 
imaging to determine the nucleotide sequence accurately. The MiSeq FGx system offers 
several advantages: higher throughput, improved sensitivity, and the ability to simulta- 
neously analyze multiple markers, including STRs, SNPs, and other genomic regions 
of interest. These features make the platform particularly suitable for forensic genomics, 
where obtaining comprehensive and accurate data from challenging samples is often 
crucial. 

Verogen offers a suite of forensic kits designed to work with the MiSeq FGx system, 
providing targeted and streamlined solutions for various forensic applications. Essential 
Verogen forensic kits include the ForenSeq DNA Signature Prep Kit, which enables 
simultaneous, high-resolution sequencing of 231 markers, comprising 153 identity 
markers (amelogenin, the sex chromosome marker, 27 autosomal STRs, 24 Y-STRs, 
7 X-STRs, and 94 identity SNPs), 22 phenotypic SNPs, and 56 ancestry SNPs. In addi- 
tion, the ForenSeq Kintelligence Kit targets kinship and intelligence markers, providing 
information on biogeographical ancestry, EVCs, and genetic relatedness by analyzing 165 
markers, including 94 autosomal SNPs, 27 Y-SNPs, 22 X-SNPs, and 22 phenotypic 
SNPs. Finally, the ForenSeq mtDNA Control Region Kit targets the hypervariable re- 
gions of the mitochondrial DNA (mtDNA) control region, offering high-resolution 
sequencing for human identification and maternal lineage analysis, optimized for chal- 
lenging samples, and providing robust coverage of the mtDNA control region. 

A considerable amount of literature on the ForenSeq kit has been published, assessing 
the forensic performance of both the kit and the sequencing system, including tests on 
sensitivity, reproducibility, concordance, and mixture analysis (Almalki et al., 2017; lozzi 
et al., 2015; Kocher et al., 2018; Silvia et al., 2017; Xavier & Parson, 2017). Other studies 
have focused on the applications of the ForenSeq kit in forensic DNA analysis, including 
its performance on degraded DNA (Calafell et al., 2016; Zhang et al., 2018), kinship 
testing (Kostrzewa et al., 2017; Ma et al., 2016), and ancestry prediction (King et al., 
2018; Ramani et al., 2017). To date, several studies have tested the performance of 
the ForenSeq markers on different populations (Casals et al., 2017; Churchill et al., 
2017; Devesse et al., 2018; Kim et al., 2018; Novroski et al., 2016; Wendt et al., 
2016). These studies have demonstrated that sequence variants can be detected in the 
STRs and their flanking regions (indel or SNP variants). In addition, SNPs can provide 
additional allelic diversity compared to CE-based methods with their flanking regions. 
This increased number of genetic variants detected in forensic-marker regions using 
NGS enhances the power of this kit as a forensic DNA tool. 

Forensic laboratories face considerable challenges in implementing NGS technology 
due to its labor-intensive workflow and the significant expense associated with 
sequencing, including the cost of equipment and kits. However, as the sequencing 
cost has dramatically dropped and various NGS systems have become available, the 
NGS system is expected to become more straightforward and affordable in the coming 
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years. This shift will undoubtedly accelerate the advancement of forensic genomic 
applications. 

The integration of phenotype prediction with DNA profiling amplifies the investiga- 
tive potential of forensic genetics. Identifying phenotypic markers through NGS-based 
SNP analysis has become valuable in forensic science and criminal investigations. While 
the identification of markers linked to external physical characteristics, such as hair, skin, 
and eye color, along with facial features, has garnered substantial attention, it is imperative 
to acknowledge that numerous other SNPs could provide equally beneficial information. 
Incorporating these additional SNPs into a single NGS kit could provide more compre- 
hensive genetic information for identification purposes, leading to more accurate and 
efficient forensic investigations and contributing to solving the most challenging criminal 
cases in the future. A deliberate and cautious approach toward incorporating new pheno- 
typic markers can help maximize the benefits of genetic information in forensic science. 
Further exploration of the underlying genetic basis of complex physical traits, along with 
the development and validation of additional markers, may enhance the prediction accu- 
racy of FDP in investigations. 
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Introduction 


When a biological sample is recovered from a crime scene, obtaining the genetic profile 
of this individual using autosomal Short Tandem Repeats (STRs) markers on a capillary 
electrophoresis (CE) platform is the standard approach implemented in investigative 
forensic laboratories. It was introduced in forensic casework in 1994 and since then it 
has been considered as the “gold standard” of genetic identification. This technique 
allowed the progress of forensic DNA analysis and the subsequent creation of national 
DNA databases. The first database was launched in 1995 by the British Home Office, 
and in 1997, the Austrian DNA database of the Ministry of the Interior became effective 
(Parson, 2018). The USA’s National DNA Index System (NDIS) was implemented soon 
after, in 1998. 

The main goal of STR profiling is to determine if a person could be the donor of a 
sample recovered from the crime scene. This is done by comparing the genetic profiles of 
the query sample and reference samples from victims and suspects. If there is no match, 
the unidentified profile can be compared with the ones included in DNA databases. 
Similarly, in missing person or mass disaster cases, the genetic profiles obtained from 
the victims can be compared to profiles from missing persons’ relatives or reference sam- 
ples collected from victims’ items. However, in any of these situations, the failure to iden- 
tify the unknown profile may slow down or cease the investigative process. If there are no 
additional suspects or leads, the crime could turn into a cold case. 

Therefore, traditional DNA profiling, although extremely valuable, may not be 
immediately helpful in specific situations (Kayser & Schneider, 2009). Due to occasional 
limitations in the use of STR profiles, other approaches for the use of DNA for forensic 
analysis have begun to be investigated. 
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Forensic DNA Phenotyping 


In the late 2000s, a new and promising area of study emerged: Forensic DNA Phenotyp- 
ing (FDP). This methodology aims to predict externally visible characteristics (EVCs) of 
an individual based solely on their DNA, obtained from the biological material left 
behind at the crime scene, or from unknown bodies, for example. Thus, this technique 
would provide additional leads that could help the police to continue investigating the 
case or narrow down the list of suspects (Kayser, 2015). 

The nomenclature “Forensic DNA Phenotyping”’ first appeared in 2008, in an article 
published by Koops and Schellekens (2008) that discussed the regulatory issues of the 
FDP technique in forensic casework, i.e., to what extent it should be allowed. However, 
the idea of inferring physical traits of an individual for helping police investigations 
already existed, under the name “DNA-based prediction.” Therefore, studies aiming 
to understand the molecular basis of some EVCs were already being conducted. Such 
studies were carried out searching for associations between a given physical trait and 
SNPs of genes that could be involved in the arising of that phenotype, through gene- 
based or Genome-Wide Association Studies (GWAS). Usually, GWAS are limited to 
variation sites genotyped through array chips. In the following sections, we will discuss 
the advantages brought by Next-Generation Sequencing (NGS) to FDP analysis. 


An SNP-based tool 


A positive aspect of FDP is that appearance traits are inferred mostly through Single 
Nucleotide Polymorphisms (SNPs) analysis. In forensic casework, SNPs present some ad- 
vantages over STRs. Since a SNP is defined by the change of only one nucleotide in the 
DNA sequence, very short PCR amplicons are necessary for its accurate genotyping (e.g., 
50 nucleotides or less). In forensic casework, it is not unusual that the sample recovered is 
in low amounts and/or degraded, rendering it impossible to genotype STRs due to the 
presence of large alleles composed of many repeats of a given sequence. Therefore, the 
short amplicon size required for SNP analysis increases the chance of a successful analysis 
of the recovered DNA. Moreover, the alleles of a SNP are easily designated due to the 
absence of a repetitive structure, which may produce artifacts such as stutter peaks that 
sometimes challenge STRs analyses (Kayser & De Knijff, 2011). 

The main disadvantage of SNPs for human identification purposes is that SNPs are 
mostly biallelic, and then they are less polymorphic than multiallelic STRs. Thus, they 
present limited heterozygosity and are less informative (Wendt & Novroski, 2019). 
Despite that, this type of marker is abundant and is spread throughout the whole genome. 
The 1000 Genomes Project has already identified more than 80 million SNPs with minor 
allele frequency greater than 1%. They also estimated that the differences between a 
typical and a reference human genome can vary from ~3.5 million to ~4.3 million 
SNPs (The 1000 Genomes Consortium, 2015). Other studies have concluded that the 
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analysis of 50-100 autosomal SNPs can provide match probabilities comparable to 
the 10—15 STRs used for forensic purposes (Pakstis et al., 2010; Sanchez et al., 2006). 
Nowadays, the multiplex analysis of large amounts of SNPs is not a problem anymore 
due to the NGS technology, also known as massively parallel sequencing (MPS). This 
methodology has improved considerably over the past decade. While its cost and effort 
time have decreased, its sensitivity and ease of use have increased. All these aspects 
together are facilitating the access of NGS technology to genetics laboratories worldwide. 
A survey conducted in 105 laboratories (from police, governmental, academic, and pri- 
vate institutions) of 32 European countries revealed that 46% of the participants already 
owned an MPS platform, and 26.7% have ordered it or were planning to do so within the 
next 2 years. Since data were collected during 2019, it is expected that, nowadays, 73% of 
the labs interviewed have an operational MPS platform. MiSeq/MiSeqFGx (Illumina/ 
Verogen) and Ion PGM/Ion S5 (Thermo Fisher Scientific) are the most used platforms 
in those labs (Gross et al., 2021). 


For prediction of complex traits 


All EVCs are considered polygenic and complex traits, which means that other factors 
(internal or external) besides the DNA sequence influence human physical traits. Skin co- 
lor, for example, is primarily determined by the type and amount of melanin present in its 
tissue, which is mainly regulated by the genetic background of an individual. As an 
external factor influencing skin pigmentation, we can cite the sunlight effect: skin color 
can change if it is exposed to sunlight (and has some degree of tanning ability), becoming 
darker. As an internal factor, hormonal differences between males and females, or even 
between individuals of the same sex, may influence the skin pigmentation outcome 
(Slominski et al., 2004). Thus, this phenotype may vary significantly irrespectively of 
the genotypic background (Serre et al., 2018). 

The same scenario holds in different ways for other EVCs, such as hair color. Some 
people experience a darkening in this trait as long as they pass from childhood to adult- 
hood (e.g., light blond hair as a child and dark blond/light brown hair as an adult). The 
molecular mechanism that explains this phenomenon is not fully understood; however, 
studies have shown that changes in the morphology of a melanosome with age (e.g., its 
enlargement), which can be caused by modified expression of genes involved in melano- 
genesis, can be related with hair color changes (Commo et al., 2012; Itou, 2018). Some 
EVCs are less complicated than others; as the complexity of a trait increases, the difficulty 
of predicting it accurately increases as well. 


Human pigmentation as a starting point 


When the development of the FDP area began, the molecular basis of many EVCs was 
already being studied; however, human pigmentation was the trait that had its genetic 
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base most well understood at the time. It is considered a less intricate trait when 
compared to other physical traits. Pigmentation is primarily determined by the presence 
of melanin in the epidermis, iris, and hair. As many diseases or inflammatory processes are 
related to changes in pigmentation, the melanin biosynthesis pathway was already under 
scrutiny before the emergence of FDP. In 2007, the first GWAS on human pigmentation 
traits was published. Afterward, many other GWAS for eye, hair, and skin color were 
performed. Nowadays, it is estimated that more than 100 genes are involved in human 
pigmentation, and HERC2, OCA2, TYR, MC1R, SLC45A2, SLC24A5, and TYRP1 
genes have already been identified as important contributors to this trait (Liu et al., 
2013; Pavan & Sturm, 2019). 

In 2011, the first tested and validated FDP tool was developed for eye color predic- 
tion: the IrisPlex system, based on six highly informative SNPs from six genes. It classifies 
eye color into three categories: blue, intermediate, and brown. The tool relies on a multi- 
nomial linear regression (MLR) algorithm to predict blue and brown eyes with over 90% 
sensitivity (Walsh, Liu, et al., 2011). Later, in 2013, the HIrisPlex system was launched, 
including 24 markers (23 SNPs and 1 indel) for eye and hair color simultaneous predic- 
tion (Walsh et al., 2013). It consists of the well-known IrisPlex system in addition to 
another MLR model for hair color prediction based on 22 markers (4 SNPs are used 
in both models). Hair color is classified into four categories: blond, brown, black, and 
red. HIrisPlex’s sensitivity ranges from 33.3% for black to 66.5% for blond hair. Then, 
in 2018, an MLR model based on 36 SNPs for skin color prediction was developed 
and combined with the other two predictive models, conceiving the HIrisPlex-S system 
(Chaitanya et al., 2018; Walsh et al., 2017). Comprising 41 markers, this tool was pro- 
posed for the simultaneous prediction of eye, hair, and skin color. Skin color categories 
were determined according to the Fitzpatrick scale, which considers the skin tone and the 
ability to tan. There are five categories: very pale, pale, intermediate, dark, and dark-to- 
black skin. Its sensitivity for skin color ranges from 7.8% for very pale skin to 81.0% for 
dark-to-black skin. The three predictive models are freely available online and all their 
quality metrics are displayed on the tool’s website (https://hirisplex.erasmusme.nl/). 

Despite being currently the most well-established tool for eye color prediction, 
IrisPlex presents a critical limitation since it struggles with intermediate eyes (i.e., non- 
blue/non-brown colors). This fact is recognized by its developers, and the sensitivity dis- 
played at the HIrisPlex-S official website for this trait is only 0.1%. The model has been 
tested in populations of different ancestry backgrounds, and the results are consistent: the 
tool did not produce almost any intermediate eye outcome (Balanovska et al., 2020; Car- 
ratto et al., 2021; Dario et al., 2015; Dembinski & Picard, 2014; Kastelic et al., 2013; 
Meyer et al., 2021; Palmal et al., 2021; Pneuman et al., 2012; Salvoro et al., 2019; 
Yun et al., 2014). However, when the MLR prediction model is adjusted based on 
the allele frequencies of the tested population, intermediate eye color sensitivity increases, 
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ranging from 10% to 65% (Dario et al., 2015; Dembinski & Picard, 2014; Meyer et al., 
2021; Palmal et al., 2021; Salvoro et al., 2019). 

There are fewer studies evaluating hair and skin prediction models. HIrisPlex-S, 
when tested on 6500 Latin Americans from the Candela Consortium, presented AUC 
(Area Under the Receiver Operating Characteristic Curve) values ranging from 65% 
for brown to 90% for blond hair, and from 71% for intermediate to 78% for light 
skin. The results for hair are close to the ones obtained for a Brazilian population sample: 
AUC values ranged from 70% for brown to 88% for blond hair (Carratto et al., 2021). 
When tested in North Eurasian populations, AUC values between 76% (brown hair) 
and 92% (black hair) were observed (Balanovska et al., 2020). 

Another set of predictive tools is hosted at the Snipper app suite 3 website (http:// 
mathgene.usc.es/snipper/). This portal was designed by the Bioinformatics Group of the 
Faculty of Mathematics of the University of Santiago de Compostela and hosts different 
classification tools for human populations. For eye color prediction, the user can choose 
between a set of 7, 13, or 23 SNPs, and three classification algorithms: naive-Bayes 
classifier, MLR, or genetic distance. Eye color can be classified into three categories: 
blue, green/hazel, and brown (Ruiz et al., 2013). For hair color prediction, a set of 12 
SNPs and the same three classification algorithms is available for classifying the phenotype 
in blond, brown, black, or red color (Sdchtig et al., 2015). For skin prediction, the user also 
can choose between those three classification algorithms to predict its color into white, in- 
termediate, or black, using 10 SNPs (Maronas et al., 2014). Like the HIrisPlex-S system, 
the Snipper tools are user-friendly and available online at no cost. 

HIrisPlex-S and Snipper are the best known tools, but there are other predictive 
models in the literature, like the one proposed by Hart et al. for eye and skin color 
prediction, relying on genotypes of 8 SNPs (Hart et al., 2013), and the models proposed 
by Allwood and Harbison (2013), which use three and four SNPs to classify eye color in 
blue, intermediate and brown. More recently, further sets were developed. The EyeCol- 
our 11 (EC11) SNP set was proposed by Meyer et al. for eye color prediction in Scandi- 
navians (Meyer et al., 2021). Palmal et al. (2021) proposed the CAN set, including 56, 101, 
and 120 SNPs for eye, hair, and skin color prediction, respectively. This set was designed 
based on the GWAS data from over 6500 Latin Americans of the CANDELA consortium. 

Of all models cited above, the HIrisPlex-S system is the only one with a validated low 
throughput assay. Along with IrisPlex evolution, multiplex assays for genotyping the 
required markers were also designed, based on single base extension (SBE)— SNaPshot 
— and CE. The first assay for the 6 IrisPlex SNPs showed great sensitivity, producing 
complete profiles from 31 pg of input DNA. This is extremely important for forensic 
casework since samples are often found degraded or in low amounts (Walsh, Linden- 
bergh, et al., 2011). When HIrisPlex was launched a few years later, a new multiplex 
assay was designed containing its 24 markers. The minimum amount of DNA required 
for analysis was set at 63 pg, which continues to be an adequate quantity for forensic 
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casework (Walsh et al., 2014). Due to the limited multiplex capacity of SNapShot 
(around 30 markers), the 41 markers of the most recent HIrisPlex-S system had to be split 
into two batches. Therefore, Chaitanya et al. maintained the previous HIrisPlex assay 
with 24 markers and developed an additional one with the 17 remaining SNPs. The 
new assay also preserved the sensitivity of 63 pg to produce complete genotypes. 
Thus, the HlIrisPlex-S system comprises two multiplex assays, each one requiring 
63 pg of DNA, and three statistical prediction models for eye, hair, and skin color infer- 
ence (Chaitanya et al., 2018). 

HIrisPlex-S multiplex assays were developed based on the SNaPshot and CE tech- 
nologies aiming its use for police investigations, since basically every investigative labo- 
ratory of forensic genetics owns these instruments. However, the fact that the SNaPshot 
genotyping assay cannot handle a large number of DNA variants is an important disad- 
vantage for the FDP technique. Since this area is still in development, new predictive 
markers are still being identified and, therefore, the existing tools will probably be 
updated in the future to include such markers. Using SNaPshot/CE technologies, 
more multiplex assays would have to be designed to include new markers. It would 
considerably increase the amount of DNA required, as well as the time, cost, and effort 
employed in the analysis. In addition, it is noteworthy that FDP analysis would usually be 
carried out only after the traditional STR analysis; therefore the amount of DNA left for 
FDP analysis would be decreased. Also, the CE methodology presents limitations for 
SNP analysis when there is a DNA mixture, which could be very usual in rape cases, 
for example, due to its semi-quantitative dye-based signals. 


Considerations regarding the applicability of NGS to forensic DNA 
phenotyping 

The use of NGS technology overcomes the previously mentioned limitations. It has an 
increased multiplex capacity, allowing the analysis of thousands of markers in a single 
assay. Thus, envisaging the transition from CE to MPS, the HIrisPlex-S developer group 
proposed a protocol to generate and analyze its 41 markers from raw MPS data (Breslin et 
al., 2019). They also performed the forensic developmental validation of the MPS assay 
on two platforms commonly used in forensics: lon Torrent and MiSeq. Samples simu- 
lating casework were obtained from blood, and other sources, such as saliva, hair bulbs, 
semen, and low quantity touch DNA. Using commercial control samples in serial con- 
centrations (5, 10, 25, 50, 100, 250, 500 pg, and 1 ng), the Ion Torrent MPS assay was 
able to produce complete genetic profiles from 100 pg of DNA, while the MiSeq MPS 
assay required 250 pg of DNA. This input difference may be due to the unequal number 
of samples used in each run. While MiSeq sequenced 96 samples with one cartridge, Ion 
Torrent sequenced the 96 samples split into two chips with 48 samples each. Therefore, if 
a reduced number of samples were analyzed in MiSegq, the sensitivity achieved for this 
platform could have been higher. In fact, except for touch DNA, both platforms yielded 
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complete profiles for all sources simulating casework from 100 pg of DNA. Although this 
minimum amount of required DNA is higher than the amount required for the two 
HIrisPlex-S SNaPshot assays (63 pg), in the MPS platform all the markers can be analyzed 
in a single run, consuming DNA only once. Currently, the 24 HIrisPlex markers for eye 
and color simultaneous prediction are included in the ForenSeq DNA Signature Prep Kit 
(Verogen) for preparing samples’ library to be sequenced on MiSeq (Illumina) and MiSeq 
FGx (Verogen) platforms. HIrisPlex markers also compose the lon AmpliSeq DNA Phe- 
notyping Panel, which is compatible with the Applied Bioystems Precision ID 
sequencing workflow and can be sequenced on Thermo Fisher platforms. 

In the same study, a protocol to interpret a mixture of two samples from NGS data 
was proposed. Forty mixtures were simulated and an average of 28 HIrisPlex-S markers 
could be separated into two profiles. An evaluation of sensitivity for artificially degraded 
samples prepared for sequencing on the MiSeq platform was performed. 500 pg of DNA 
was exposed to Ultraviolet (UV) radiation through different time windows. Complete 
HIrisPlex-S genetic profiles were obtained even after 10 min of UV light exposure, 
establishing MPS platforms as a robust methodology for forensic purposes. 

In spite of the importance of HIrisPlex-S in the FDP area, its accuracy can still be 
improved, even for Europeans. In such populations, pigmentation traits may show 
some degree of correlation, such as the common cooccurrence of blue eyes and light 
skin, or dark skin and brown eyes. A study with individuals from Europe and the United 
States estimated and added the correlation between pigmentation phenotypes to the 
HIrisPlex-S_ prediction model. They observed that: (a) the accuracy of the system 
increased, and (b) the 41 DNA variants could explain only a fraction of this correlation. 
Therefore, they conclude that more independent predictive markers need to be identi- 
fied, and their addition to the current established predictive tools would increase its ac- 
curacy (Chen et al., 2021). Pigmentation traits (as well as other EVCs) are determined by 
variation sites of major effects, in addition to variation sites of minor effects. Thus, work is 
needed to continue unraveling the genetic basis of human pigmentation. In 2021, 50 and 
111 new variants influencing eye and hair color, respectively, were identified in two large 
GWAS conducted on thousands of Europeans (Hysi et al., 2018; Simcoe et al., 2021). 

Moreover, to assure the global performance of predictive tools, the accomplishment 
of GWAS in non-European populations is extremely important, with particular attention 
to admixed populations. GWAS conducted on African populations have revealed novel 
aspects of their genetic architecture (Crawford et al., 2017; Martin et al., 2017). 
Regarding admixed populations, Latin Americans present a complex admixture pattern 
involving Native Americans, Europeans, and sub-Saharan Africans: Amerindian ancestry 
is stronger in Mexico, Guatemala, Peru, and Ecuador, while Cuba, Chile, Colombia, 
Puerto Rico, Venezuela, Argentina, Brazil, and Uruguay have a substantial contribution 
from Europeans, and countries in the Caribbean area (such as Haiti and Jamaica) display 
higher levels of African ancestry (Norris et al., 2018; Rodrigues-Soares et al., 2020; 


317 


318 


Next Generation Sequencing (NGS) Technology in DNA Analysis 


Ruiz-Linares et al., 2014; Salzano & Sans, 2014). Due to the admixed background, non- 
European alleles (or combinations of them) derived from different biogeographical an- 
cestries may be influencing pigmentation in such populations as well. For example, recent 
studies have identified associations between the MFSD12 region and skin color in East 
Asians and Native Americans, supporting the theory that light skin has evolved indepen- 
dently in West and East Eurasia (Adhikari et al., 2019; Lona-Durazo et al., 2019). 

Massively Parallel Sequencing assays designed to explore the genetic diversity of 
pigmentation genes, notably in non-European populations, may play a role of utmost 
importance in identifying less common variation sites with minor effects that are hardly 
identified by array-based GWAS. Moreover, the need to increase the number of inde- 
pendent markers coupled with the need of introducing non-European variation sites 
into forthcoming FDP tools in order to achieve more accurate eye, hair, and skin color 
predictions may restrict their use to Next-Generation Sequencing assays. And this 
conclusion may be easily extended to all polygenic EVCs. 


Other EVCs as FDP targets 


We have discussed only pigmentation traits so far because they already have established 
predictive tools that were tested in many different populations, highlighting their 
strengths and weaknesses that must be improved. However, there are many other prom- 
ising EVCs currently being studied. 

It has been demonstrated that many loci associated with human pigmentation are also 
associated with the presence of freckles (Eriksson et al., 2010; Sulem et al., 2007, 2008), 
and some predictive models have already been developed. The first one was published in 
2018 and used SNPs from four genes (MC1R, IRF4, ASIP, and BNC2) to classify indi- 
viduals as freckled or nonfreckled, achieving an accuracy of 74.13% (Hernando et al., 
2018). Another study developed two models based on 17 and 19 SNPs: a binary (i.e., 
freckled/nonfreckled) and a multinomial (non/medium/heavily freckled) one. The 
models also considered gender and interaction between SNPs as predictive variables. 
The binary model achieved a sensitivity of 84.0%, while the sensitivity of the multinomial 
was 54.5%, 74.3%, and 17.2% for non, medium, and heavily freckled individuals, respec- 
tively (Kukla-Bartoszek et al., 2019). 

Body height is a characteristic that has been studied since the beginning of the FDP 
era, because it is commonly measured in many cohort studies for different purposes and, 
therefore, this information was promptly available. It has been estimated that heritability 
for human height is around 80% (Liu et al., 2019). However, height is a complex and 
continuous trait, and it is estimated that hundreds or thousands of SNPs are required 
for an accurate prediction. Since the exact stature is very hard to predict, some models 
have been developed converting the stature into a discrete variable using binary classifi- 
cations (i.e., tall — nontall). Three GWAS were conducted by the International Genetics 
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of Anthropometric Traits (GIANT) (Allen et al., 2010; Wood et al., 2014; Yengo et al., 
2018). The first one identified 180 SNPs associated with height, while the second study 
highlighted 697 new SNPs and the third, 512 new SNPs. Liu et al. investigated the pre- 
dictive power of the first two sets in Europeans and achieved AUC values of 0.75 and 
0.79, respectively (Liu et al., 2019). Lello et al. (2018) conducted a study for age predic- 
tion in Europeans using approximately 20,000 SNPs and obtained a correlation of 0.65 
between the predicted and the actual height of individuals. 

The eyebrow color is highly correlated with scalp hair color; however, this correla- 
tion is not perfect and, therefore, eyebrow and scalp hair color may differ. This suggests 
that, in spite of some variants influencing both traits, there are variants unique for 
eyebrow color. A GWAS conducted in 2019 has identified a new locus associated 
with eyebrow color (C10orf11) and five other loci that are known for their associations 
with scalp hair color. From these six loci, the authors developed a model using 25 SNPs 
for eyebrow color prediction and obtained AUC values ranging from 62.0% for brown to 
70.1% for blond eyebrows (Peng, Zhu, et al., 2019). 

Genes involved in the molecular basis of hair shape have been highlighted during the 
past decade. The first two genes reported as associated with this trait were EDAR and 
TCHH. While EDAR has a major effect on explaining straight hair in East Asians, the 
TCHH gene plays a major role in Europeans. GATA3, PRSS53, FRAS1, and 
WNT10A genes have shown associations with hair shape as well (Adhikari et al., 
2016; Pospiech et al., 2015). Liu et al. developed a prediction model for straight versus 
non-straight hair using 14 SNPs (Liu et al., 2018), which achieved AUCs of 0.66 and 
0.64 in two different cohorts (individuals with North-Western European ancestry 
from Australia and North-Western Europeans from the Netherlands). After that, 
PoSpiech et al. developed two models, a binary one, using 32 SNPs to predict straight 
versus non-straight hair, and a multinomial one, using 33 SNPs to predict straight versus 
wavy versus curly hair (Pospiech et al., 2015). The binary model presented an AUC of 
0.68, while the multinomial model presented AUCs of 0.68, 0.60, and 0.62 for straight, 
wavy, and curly hair, respectively. 

Still discussing hair aspects, it has been shown that the most associated SNPs for male 
hair loss are located in the AR/EDA2R region, on chromosome X. In addition, more 
than 250 loci related to this condition have been identified so far. Some predictive 
models have been developed, with sensitivities around 73%—84% (Hagenaars et al., 
2017; Liu et al., 2016; Pirastu et al., 2017). 

Predicting the facial features of an individual can be considered the ultimate goal of 
forensic DNA phenotyping. This task is a big challenge since the human face has many 
different characteristics that must be taken into account. Furthermore, there are many 
different ways in which the face can be subdivided when looking for associations. In 
the past decade, some independent GWAS for facial shape, most of them focusing on 
Europeans, have highlighted different associated SNPs. Ten genes were identified in 
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more than one study: CACNA2D3, DCHS2, EPHB3, HOXD, PAX1, PAX3, PKDCC, 
SOX9, SUPT3H, and TBX15. A recent GWAS study looking for associations between 
markers and facial distances in Europeans established 13 facial points and measured the 
distance between them. They identified 24 genetic loci (17 of them were novel) and 
then validated these findings in additional multiethnic populations. Ten loci were suc- 
cessfully replicated, six of them being novel (Xiong et al., 2019). 

Taken together, the performance parameters observed for all these EVCs highlight 
the fact that the discovery of SNPs from array-based GWAS studies is of limited extent. 
Given the polygenic nature of these EVCs, massively parallel sequencing association- 
based studies focusing on pertinent gene panels and involving populations from multiple 
ethnic backgrounds is mandatory for real progress in this field. Such studies would not 
only allow the identification of population-specific variation sites of major effect and mi- 
nor effect variation sites of worldwide relevance, but also a proper assessment of 
population-specific linkage disequilibrium within the pertinent genes in order to assist 
with the selection of independent markers for FDP purposes. 

Although biogeographic ancestry (BGA) is not a morphological trait, it is frequently 
considered under the FDP scope. So far, many ancestry informative markers (AIMs) 
have been identified, and many sets for BGA inference have been developed. The 
SNPforlD 34-plex forensic ancestry test is one of the most established sets. It includes 34 
SNPs and is focused on the differentiation of sub-Saharan Africans, Europeans, and East 
Asians (Fondevila et al., 2013). The Snipper app suite 3 website hosts an online tool 
that allows the use of the 34-plex set. To fill the gap of Native American ancestry, the 
PIMA (Population Informative Multiplex for the Americas) set was published in 2020. 
It could be considered as a complement of the previous 34-plex, and comprises 26 
AIM-SNPs, focusing on differentiating American ancestry from other continental popu- 
lations (Carvalho Gontijo et al., 2020). Currently, there are other established BGA infer- 
ence sets, such as Seldin’s 128 AIMs and the Kidd lab set of 55 AIMs (Kidd et al., 2014; 
Kosoy et al., 2009). 


DNA methylation and age prediction 


Even though age cannot be directly predicted from the DNA sequence, since it barely 
changes over time, epigenetic marks are more compliant with changes throughout a life- 
time. In this realm, DNA methylation markers have been suggested as the most informa- 
tive tool for age prediction (Freire-Aradas et al., 2020). DNA methylation is essential for 
silencing retroviral elements, regulating tissue-specific gene expression, genomic 
imprinting, and X chromosome inactivation (Moore et al., 2013). It involves the attach- 
ment of a methyl (-CH3) group to the 5’ carbon (C5) of a cytosine linked to a guanine 
through a phosphate group (CpG). CpG sites are usually methylated and relatively rare 
across the genome (Deaton & Bird, 2011), except for specific locations called CpG islands 
(CGIs). CGIs correspond to regions predominantly unmethylated with high CG density 
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(>55% CG), mainly located at the gene promoter regions (Antequera & Bird, 1993; Bird 
et al., 1985). 

The presence of 5-methylcytosine (5 mC) in CpG islands represses gene expression 
by regulating chromatin structure and preventing transcription factors from binding to 
a promoter region. Since these regions are associated with the control of gene expression, 
CpG islands could exhibit a tissue-specific pattern of DNA methylation (Moore et al., 
2013). However, CpG islands associated with transcription start sites rarely show 
tissue-specific methylation patterns (Meissner et al., 2008), while regions located as far 
as 2 kb from CpG islands, called CpG islands shores, present evolutionary conserved pat- 
terns of tissue-specific methylation (Irizarry et al., 2009). 

Methylation patterns are heritable and transmitted with high fidelity during DNA 
replication (Jones & Liang, 2009). DNA methyltransferase 1 (ODNMT1) maintains the 
DNA methylation pattern in a cell lineage (Hermann et al., 2004). Passive or active 
mechanisms can remove these DNA methylation patterns. The passive DNA demethy- 
lation occurs in the absence of the DNA methylation maintenance machinery. Active 
DNA demethylation depends on the activity of proteins from the ten-eleven transloca- 
tion (TET) family: TET1, TET2, and TETS (Ito et al., 2010). DNMT3A and DNMT3B 
enzymes, complexed with histone deacetylases (HDACs) and Histone H3 lysine 9 
methylation (H3K9me), are responsible for de novo methylation, which occurs in sites 
with no previous indication of methylation (Epsztejn-Litman et al., 2008). 

5 mC corresponds to approximately 1.5% of genomic DNA (Lister et al., 2009). It is 
highest in the embryo and decreases gradually over time by a process that includes de 
novo methylation of CpG islands bound to polycomb repressors and demethylation of 
other genomic sites. Despite tissue-specific rates and speed, these methylation changes 
occur in all tissues over time (Dor & Cedar, 2018; Maegawa et al., 2010). Thus, by 
measuring the methylation status derived from a DNA sample of a given tissue, it is 
possible to predict the individual’s age (Horvath, 2013). 

Methylation patterns have been widely used to predict age, since they are altered 
throughout life and are essential for the aging process. Generally speaking, age- 
associated epigenetic markers comprise locus-specific hypermethylation and global hypo- 
methylation (Xiao et al., 2016). Moreover, the 5 mC levels are tissue-specific, a factor 
that justifies the importance of assessing the most suitable predictors for each tissue 
(Horvath, 2013; Parson, 2018). Although many statistical models have been proposed 
for predicting age in different tissues, most age prediction studies employ blood or buccal 
samples (Bocklandt et al., 2011; Eipel et al., 2016; Freire-Aradas et al., 2016; Hong et al., 
2017; Park & Kim, 2016; Vidaki et al., 2017; Zbie¢-Piekarska et al., 2015). 

Bocklandt et al. (2011) developed the first epigenetic age predictor model based on 88 
CpG sites from saliva samples and achieved an average accuracy of 5.2 years. Many studies 
have suggested age prediction models based on CpG sites located in the ELOVL2 gene 
(Freire-Aradas et al., 2016; Hamano et al., 2016; Hannum et al., 2013; Park & Kim, 
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2016; Peng, Feng, et al., 2019; Zbie¢-Piekarska et al., 2015; Zubakov et al., 2016). Han- 
num et al. (2013) provided a high accuracy chronological age estimator based on 71 CpG 
sites, showing a 96% correlation with age and a median absolute deviation of 3.9 years. 
Horvath (2013) studied 8000 samples covering 51 tissues and identified 353 age- 
associated CpG sites, with a 96% age correlation and 3.6 years average error. Horvath 
et al. (2018) developed epigenetic clocks to predict age in specific cell types such as 
skin, endothelial cells, and buccal cells and made available an online DNAm age calcu- 
lator (https://dnamage.clockfoundation.org/calculator). 

Another freely available predictive tool is also noteworthy. It is hosted at the previ- 
ously addressed Snipper app suite 3 website (http://mathgene.usc.es/snipper/age_ 
models.php) and uses seven markers to infer age (Freire-Aradas et al., 2016), with a pre- 
dictor error of approximately 3 years. This is the most suitable tool for use in forensic 
casework, since they involve a limited number of methylation markers that can be easily 
evaluated by Sanger sequencing after bisulfite conversion (as explained below). Howev- 
er, the identification of new methylation markers associated with age is still necessary. 

There are three main groups of techniques providing the identification of regions that 
are differentially methylated: bisulfite conversion—based methods, restriction enzyme— 
based approaches, and affinity enrichment—based assays (Pajares et al., 2021). The restric- 
tion enzyme—based methods take advantage of the differential digestion properties of 
methylation-sensitive restriction enzymes (MSREs) and the affinity enrichment—based 
methods employ either methyl-CpG-binding domain (MBD) proteins or antibodies spe- 
cific for 5 mC (as in MeDIP) to enrich methylated DNA regions. Notwithstanding, 
bisulfite conversion methods are the standard approach to analyzing methylation patterns. 
The technique takes advantage of differential deamination kinetics of unmethylated and 
methylated cytosines. Before PCR amplification, the unmethylated cytosines will be 
deaminated to uracil, while methylated cytosine remains unaffected (Parson, 2018). 
Then, these methylation patterns will be analyzed by downstream techniques, such as 
sequencing-based approaches (Li & Tollefsbol, 2021) or methylation-specific PCR 
(Herman et al., 1996). The Hlumina Infinitum HumanMethylation450 (450 K) Beadchip 
array-based platform is one of the most used methods for characterization of DNA 
methylation after bisulfite conversion (Solomon et al., 2018). The Illumina Beadchip 
array provides a user-friendly and cost-effective alternative for methylation pattern anal- 
ysis; however, there are associated biases, such as the inclusion of probes that interrogate 
only previously identified CpG sites (Yong et al., 2016). 

Therefore, NGS methodology empowers the study of epigenetic variation, offering 
high resolution data and a more detailed view of its modifications (Meaburn & Schulz, 
2012). These technologies allow the interrogation of genomes without prior knowledge 
of sequence (Barros-Silva et al., 2018). Therefore, the second and third sequencing- 
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generation methods are helpful for exploratory bisulfite-converted DNA analysis. One of 
the greatest advantages of NGS approaches is that precise ratios of methylation may be 
easily established by observing the presence of hundreds or thousands of reads by locus 
in different cell types. Moreover, these high-throughput deep sequencing methods are 
fast, reliable, produce terabytes of data, are relatively accessible for genome-wide analysis, 
and provide resolution at single base-pairs (Laird, 2010; Soto et al., 2016; van den Oord 
et al., 2013). This single-base resolution may be of paramount importance for the iden- 
tification of new age-related methylation sites. 

The forensic field has made few attempts at MPS implementation in DNA methyl- 
ation analysis with age prediction purposes. Vidaki et al. employed a combined method 
based on machine learning and NGS to generate a model for age prediction from blood 
samples. Based on identifying the methylation status of 16 CpG sites, they proposed a 
model achieving an age correlation of 0.86 and a mean absolute error (MAE) of 7.45 
years (Vidaki et al., 2017). Naue et al. presented an age-prediction tool based on MPS 
and a random forest machine learning algorithm. Using whole blood samples, they pro- 
posed a model based on four genes (ELOVL2, F5, KLF14, and TRIM59) with a MAE of 
3.2 years and a root-square error (RMSE) of 4.19 years (Naue et al., 2017). 

A model for predicting chronological age combining MPS and machine learning 
methods was proposed from 12 CpG sites previously selected by Vidaki et al. The model 
showed an RMSE of 4.9 years with an MAE of 4.1 years (Aliferi et al., 2018). Using the 
MiSeq FGx instrument (Verogen on Illumina technology), Heidegger et al. introduced 
the VISAGE primary prototype tool for age estimation aiming at 32 CpG sites from five 
genes (ELOVL2, MIR29B2C, FHL2, TRIM59, and KLF14). They observed a mean- 
standard deviation of 1.4% across ratios for age prediction estimates on blood samples 
(Heidegger et al., 2020). Following this, Heidegger et al. evaluated the methylation 
pattern in 13 CpG sites from semen samples. The proposed tool was independently vali- 
dated by five laboratories from the Visible Attributes Through Genomics (VISAGE) 
project consortium, with the assay performance showing an MAE of 5.1 years (Heideg- 
ger et al., 2022). Considering blood, buccal swab, and bone samples, Wozniak proposed 
three age prediction models comprising the VISAGE Consortiums enhanced tool for 
epigenetic age estimation. They achieved a model based on six CpGs from six genes 
with an MAE of 3.2 years for blood, a model for buccal cells using five CpGs from 
five genes yielding an MAE of 3.7 years, and a model employing six CpGs from four 
genes that showed an MAE of 3.4 years for bone samples (Wozniak et al., 2021). 

Although NGS approaches are expected to be widely used in the future, this is not yet 
the method of choice in the case of small-scale projects requiring low-resolution 
sequencing information regarding methylation, especially due to the complex data anal- 
ysis and higher cost of data generation (Barros-Silva et al., 2018; Rauluseviciute et al., 
2019; Wreczycka et al., 2017). 
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The VISAGE initiative and the future of FDP 


In 2017, the Visible Attributes Through Genomics (VISAGE) Consortium was estab- 
lished. It was composed of 13 participants from academic, police, and justice institutions 
of eight European countries, and from different areas of expertise (forensic geneticists, 
forensic DNA practitioners, statistical geneticists, and social scientists). The project aimed 
to further explore FDP technology, conducting more studies about markers associated 
with appearance, age, and ancestry, and using them to develop new predictive tools. 
They also focused on the statistical interpretation of data, and how this methodology 
could be used by investigative authorities. The legal, ethical, and social aspects concern- 
ing the implementation of the FDP technique in forensic casework were also discussed. 

The project formally came to its end in 2021 and, so far, more than 20 scientific pa- 
pers comprising all the topics mentioned above have been published. The Consortium 
has developed two types of predictive tools: a basic and an enhanced one. They were 
developed aiming at sequencing assays performed on NGS platforms, given that the 
simultaneous analysis of hundreds of markers will be needed for their successful applica- 
tion. Separate tools were developed for (a) appearance traits (AT) and BGA, and (b) age 
prediction, since samples must be prepared in different ways for the latter purpose. 

The basic tool for AT and BGA prediction uses the 41 HIrisPlex-S markers for eye, 
hair, and color inference and 115 ancestry informative markers (AIMs) for BGA inference, 
comprising 153 markers (three of them overlap for BGA and AT prediction). This tool has 
been validated in two different platforms: a) using the PowerSeq Chemistry on the MiSeq 
FGx System and b) using a single multiplex reaction using the AmpliSeq design pipeline for 
the Ion S5 MPS platform. In general, complete genotypic profiles were obtained from as 
little as 100-125 pg of DNA (Palencia~Madrid et al., 2020; Xavier et al., 2020). The 153 
markers are covered by the Ion AmpliSeq VISAGE-Basic Tool Research Panel, suitable 
for sequencing on ThermoFisher platforms. The enhanced tool for AT and BGA predic- 
tion includes 524 markers for predicting eye, hair, skin, and eyebrow color, hair 
morphology, male baldness, and presence of freckles. BGA inference is performed using 
autosomal, X, and Y-linked SNPs, as well as microhaplotypes. This tool was validated 
on Ion S5 platform and tested in five different laboratories, emerging as a robust, sensitive, 
and reproducible assay for forensic puporses (Xavier et al., 2022). 

The VISAGE Consortium has also developed tools for age estimation, as already 
mentioned in the previous topic. The basic tool is only suitable for blood samples and 
has been validated using bisulfite conversion and posterior sequencing on the MiSeq 
FGx platform (Heidegger et al., 2020). The enhanced tool for age estimation performs 
prediction from three different tissues (blood, buccal cells, and bones), using specific 
markers and statistical models for each tissue. It was validated on the MiSeq FGx platform 
(Wozniak et al., 2021). A model for age estimation from semen has also been developed 
and validated (Heidegger et al., 2022; Pisarek et al., 2021). 


Forensic DNA phenotyping in the next-generation sequencing era 


The progress made in the FDP area since it started is huge. However, according to 
everything we have discussed so far, there is still a lot of work to be done. Even though 
genes that play a major role in the understanding of the genetic basis of EVCs have been 
highlighted, there is still a need to identify markers that play minor roles to achieve an 
ideal predictive power. Given that EVCs are complex polygenic traits, it is important 
to capture the variability of all the genes involved. As we commented before, GWAS 
are usually conducted based on array methodologies. A possible approach is to calculate 
polygenic scores for appearance traits using thousands/millions of markers of an array 
panel. However, even in such cases, many important genetic variants for a specific trait 
could be missing in the array chip. NGS technology has enabled the possibility of deeply 
sequencing a whole genome, or even specific genes of interest. Custom panels for target- 
ing genes of interest can be designed and for that, there are many currently available kits 
in the market. HaloPlex (Agilent), SureSelect (Agilent), KAPA (Roche), Lotus (IDT), 
and AmpliSeq (Illumina) are some of the well-known kits for customized library prep- 
aration available. Moreover, SureSelect offers the possibility of creating a customized 
panel optimized for sequencing on the Ion Proton platform. Lastly, there is also the pos- 
sibility of amplifying the DNA by in-house multiplex PCR methods and using Nextera 
(Illumina) for library preparation. Thus, new markers that may not be present in array 
chips can be analyzed without the need for imputation. Haplotype analysis is also simpli- 
fied since NGS can provide phased sequence reads. In addition, as commented before, 
since thousands of markers can be analyzed in a single run, predictive SNPs can be easily 
incorporated into library preparation kits as soon as they are identified. Lastly, different 
types of markers can be analyzed in the same run. The ForenSeq DNA Signature Prep 
Kit (Verogen) panel, for example, includes 59 STRs and 172 SNPs, which allows the 
STR profiling and AT/BGA prediction to be performed simultaneously if necessary, 
saving the amount of DNA required. 

It should be mentioned that long-read massively parallel sequencing platforms, such as 
PacBio and Oxford Nanopore, also known as third-generation sequencers, have been 
available since 2010. However, their major novelty, i.e., the production of reads that usu- 
ally exceeds 10,000 bp, may result in difficulties with handling degraded and/or low copy 
number samples. Therefore, second-generation sequencers remain as the best option for 
FDP purposes aiming at forensic casework. 


FDP for forensic anthropology purposes 


The robustness of NGS technology in dealing with degraded samples makes it a powerful 
tool for analyzing samples from ancient and contemporary skeletal remains, which in- 
cludes the possibility of predicting EVCs from these remains. Kukla-Bartoszek et al. 
sequenced 63 bone samples at postmortem intervals up to almost 80 years using an 
Ion AmpliSeq HlIrisPlex-S panel on the Ion Proton platform. Complete profiles were 
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obtained from 49 pg of DNA. The 41 markers were successfully genotyped in 35 samples 
(55.6%), while 23 samples presented partial profiles. Although it was not possible to pre- 
dict skin color for any of the 23 samples with partial profiles, eye and hair color were pre- 
dicted for 20 and 12 of them, respectively (Kukla-Bartoszek et al., 2020). 

Another study analyzed two skeletal remains from La Belle, a French ship that sank in 
the USA in 1986. Although human remains were recovered in 1996, they were only 
analyzed in 2015. Samples were prepared with ForenSeq DNA Signature Prep Kit 
(Verogen) and sequenced on MiSeq FGx Desktop Sequencer (Verogen). Partial profiles 
were generated for both samples: 216 (93.5%) and 154 (66.7%) markers were successfully 
called for the two samples. Eye and hair color predictions were made using HIrisPlex, and 
ancestry was also inferred. Since the ship was European, it was surprising that the first 
sample had predictions suggesting Native American ancestry, dark hair and brown 
eyes. The authors hypothesized that these remains could belong to a local Karankawa In- 
dian who was scavenging the shipwreck, became trapped, and perished. The second sam- 
ple, as expected, had a European origin. The pigmentation trait was not predicted due to 
insufficient data (Ambers et al., 2020). 

In the anthropology field, appearance information of ancient samples may help 
understand archeological and evolutionary aspects involved in the arising of phenotypes. 
Using whole-genome sequencing data, HIrisPlex-S and Snipper were employed to 
predict pigmentation phenotypes of 20 contemporary and 22 ancient Native American 
samples. Predictions were performed for 7 of 22 ancient samples and for all contemporary 
ones. Outcomes for both groups of Native Americans were consistent. Ancient individ- 
uals were predicted to have intermediate/brown eyes, black hair, and intermediate/ 
darker skin pigmentation (Carratto et al., 2020). 


Final remarks 


The interest in learning or applying predictive markers for forensic casework has been 
growing over the years. The survey conducted in 105 European forensic laboratories 
has shown that 26.7% of them have experience with ancestry SNPs analysis; 21.9% 
have already done appearance SNPs analysis, and only 14.3% have performed methyl- 
ation analysis. However, 40% of the participants are interested in learning more about 
methylation; 37.1% desire to do appearance analysis, and 32.4% plan to perform ancestry 
inference. Between 15% and 21% of the participants said that they have not used predic- 
tive markers for analysis because it is not legally allowed in their respective country of 
residence (Gross et al., 2021). It is important to note that, along with the reliability 
and improvement of FDP tools, ethical and legal aspects grounding the approval of 
this technique for forensic casework should be carefully and extensively discussed. 
Although this is not the scope of this book, these questions may be reviewed in (Nogel 
et al., 2019; Samuel & Prainsack, 2019; Toom et al., 2016). Most countries do not have 
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explicit legislation for FDP use but, at the same time, they do not have an implicit pro- 
hibition either, since the development of this technique is still recent. In Europe, the first 
countries to explicitly regulate FDP use were The Netherlands and Slovakia. In Austria, 
Germany, and Sweden, the technique is allowed, while in countries such as France, 
Poland, Spain, the United Kingdom, Australia, South Africa, and Brazil it is not allowed 
and neither implicitly forbidden. In the United States, legislation varies from state to state 
(Samuel and Prainsack, 2018, 2020). 

As mentioned at the beginning of this chapter, NGS technology is starting to become 
accessible in many forensic laboratories, at least in Europe. We foresee it as a global trend. 
The advances in NGS technology are allowing the enhancement of the FDP area, not 
only technically, but also pushing more countries to start discussing its legal regulations 
for forensic casework. 
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Introduction 


Environmental factors such as lifestyle, diet, age, and stress have a significant effect on 
the human phenotype, which may not present itself as a change in the genome but is 
expressed as a modification to the epigenome (Boyce et al., 2020; Park et al., 2012; 
Tammen et al., 2013). The epigenome is defined as the set of chemical modifications 
to the DNA and DNA-associated proteins in the cell that alter gene expression and 
are heritable (https:// www.genome.gov/about-genomics/fact-sheets/Epigenomics- 
Fact-Sheet). Regulation of gene expression is facilitated by differential epigenetic profiles 
displayed in different tissues and organs (Anastasiadi et al., 2018). Epigenetic changes 
include DNA methylation, nucleosomal remodeling, histone modification, chromatin 
looping, and noncoding RNAs. DNA methylation is the best-characterized epigenetic 
mark (Aristizabal et al., 2020; Goldberg et al., 2007; Mazzone et al., 2019). 

DNA methylation involves the covalent attachment of the methyl group to the fifth 
carbon of the cytosine residue at CpG dinucleotides by the action of DNA methyltrans- 
ferases. However, nonCpG DNA methylation has also been reported (Jang et al., 2017; 
Ramsahoye et al., 2000). Identification of differentially methylated regions (DMRs) such 
as tissue-specific (t DMRs), age-specific DMRs (Rakyan et al., 2011), population-specific 
DMRs (Hemando-Herraez et al., 2015; Heyn et al., 2013) and differentially methylated 
sites (DMSs) has fascinated the field of forensics with its potential applications. For almost 
a decade, DMRs and DMSs have been used as candidates for biomarkers in forensic 
research (Kader, Ghai, & Olaniran, 2020; Kader & Ghai, 2015). 

Several DNA methylation-based assays have been developed for monozygotic twin 
differentiation (Romanos & Borjac, 2021; Vidaki, Diez Lopez, et al., 2017), body fluid 
identification (Forat et al., 2016; Ghai et al., 2020; Lee et al., 2016), and age estimation 
(Correia Dias et al., 2020; Weidner et al., 2014). As additional epigenetic methods are 
developed and validated, the need for parallel high-throughput analysis becomes impor- 
tant. The use of massively parallel sequencing (MPS) is well-established in clinical epige- 
netics and is emerging as a new technology in the forensic field (Alvarez-Cubero et al., 
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2017; Ballard et al., 2020). MPS, also known as next-generation sequencing (NGS), is an 
alternative technology to conventional capillary electrophoresis (CE), as many forensic 
markers can be analyzed to the sequence level in a single assay (Tytgat et al., 2022). 
MPS offers the ability to significantly increase the discriminatory power of human iden- 
tification as well as aid in mixture deconvolution (Bennett et al., 2019). 

MPS has been successfully employed for forensic SNP and STR sequencing, missing 
person identification, disaster victim identification, and mitochondrial DNA analysis. See 
Ballard et al. (2020) for a complete review of MPS in forensics. The MiSeq FGx 
Sequencing System is the first and only NGS instrument developed and validated for 
forensic genomics (Jager et al., 2017). Though NGS techniques like pyrosequencing 
and [lumina have been developed for epigenetic markers, no such method has been vali- 
dated for real-world case work yet. 

The present chapter details the progress made regarding the utilization of NGS 
methods for epigenetic markers in forensics and the way forward for their routine imple- 
mentation in accredited forensic laboratories. The focus is on DNA methylation markers 
for body fluid identification and age estimation. 


Emerging MPS methods for forensics 


The MPS platforms currently used for forensic applications, which can integrate epige- 
netic markers, are briefed below. 


IIlumina NovaSeq 


To solve a cold case from 1976, the headless body of a pregnant woman was exhumed in 
2007. Attempts were made using dental records, facial reconstruction, and fingerprinting 
to obtain critical information about the victim’s identity. Finally, in November 2020, the 
DNA extract from the body was sent to Othram Inc. (https://othram.com/) for analysis. 
Illumina’s NovaSeq was employed to construct a genealogical profile of the victim. Inci- 
dentally, the profile partially matched a DNA profile submitted by the victim’s nephew 
to a DNA database. As a result, information about the victim was obtained from the 
nephew, which resulted in the arrest of the father of the victim’s unborn child. 

Othram Inc. (Othram.com) is the first private laboratory that utilized MPS to solve 
forensic cases. It has been credited for working with samples that have been regarded 
as unsuitable for analysis by conventional forensic technologies. There have been several 
reports of successful crime-solving by Othram Inc., which can be found at: https:// 
www.solvedmysteries.com/ 


Ilumina’s MiSeq FGx sequencing system 


The MiSeq FGx sequencing system allows the preparation and sequencing of libraries 
along with data analysis on one platform (Verogen.com, 2021). The MiSeq FGx 


Forensic applications of epigenetic (DNA methylation) markers through NGS 


System is capable of analyzing degraded DNA, low-quantity DNA, complex DNA 
mixtures, and other challenging samples commonly encountered at a crime scene. 
Additionally, one sequencing run interrogates hundreds of forensically relevant ge- 
netic markers, circumventing the need to choose between fragment-length-based 
STR kits. 


Oxford Nanopore Technologies MinION device 

The Oxford Nanopore Technologies (ONT) MinION device is the newest and smallest 
NGS platform available, which has been used for STR and SNP profiling. One drawback 
of nanopore sequencing is its relatively high error rate. However, a recent study has 
reported accurate autosomal STR profiles from long-read sequencing data using a 
streamlined method called STRspy, a tool that is capable of producing length and 
sequence-based STR allele designations from noisy, error-prone third-generation 
sequencing reads. Additionally, the method allowed the detection of SNPs in the flank- 
ing regions with >90% accuracy (Hall et al., 2022). The MinION device can enable on- 
site analysis, which could be a boon for sensitive cases and disaster victim identification 
(Tytgat et al., 2022). 


lon Torrent PGM and lon S5 


The Precision ID system coupled with the Ion Torrent MPS workflow has been 
successfully implemented for mitochondrial DNA (mtDNA) sequencing analysis 
by the Missing Persons DNA Program (Cuenca et al., 2020). Ion Chef is significantly 
superior to Sanger’s sequencing workflow as it saves time and is user-friendly. Addi- 
tionally, the Precision ID MPS workflow is capable of sequencing challenging sam- 
ples, particularly degraded ones that are often encountered in the Missing Persons 
DNA Program. 

However, for nonSTR and SNP markers, the Ion Torrent MPS platform still requires 
further improvement. In a collaborative exercise within 17 EUROFORGEN and 
EDNAP laboratories, the use of MiSeq/FGx and Ion Torrent PGM/S5 platforms was 
optimized for the identification of body fluids using mRNA markers. The PGM/S5 
workflow seemed to be less reliable with low input samples as compared to the MiSeq 
FGx, especially for blood, semen, or skin samples. Additionally, the PGM/S5 workflow 
(except for protocols including the Ion Chef robot) was reported to be more time- 
consuming and require more hands-on experience (e.g., chip loading) than the MiSeq 
FGx workflow (Ingold et al., 2018). 

The present chapter focuses on DNA methylation-based MPS assays developed for 
body fluid identification and age estimation. Though noncoding RNA (miRNAs and 
Piwi RNAs) markers have also been identified for forensic applications, they are discussed 
in detail in another chapter of this book. A detailed review of noncoding RNAs in fo- 
rensics is published by Haas et al. (2021). 
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Body fluid identification by MPS 


The tissue specificity of DNA methylation has been identified as an attractive target for 
forensic applications. The benefit of DNA methylation-based methods is that the DNA 
isolated for STR profiling can be used for epigenetic analysis as well, although low-input 
protocols might be challenging to work with as bisulfite conversion degrades DNA (Peat 
& Smallwood, 2018) and forensic samples are also mostly poor in quality and quantity. 

Pyrosequencing, being the first NGS technology developed, has been extensively 
used for forensic body fluid/tissue identification by DNA methylation analysis. Though 
clone-based bisulfite-sequencing efficiently detects methylation differences in single al- 
leles (Huang et al., 2013), bisulfite pyrosequencing quantifies methylation levels and reli- 
ably detects differential methylation in cell populations without using the lengthy process 
of cloning bisulfite-treated DNA into bacterial expression vectors (Reed et al., 2010; 
Vidaki et al., 2016). However, to the best of my knowledge, to date, none of the pyro- 
sequencing methods have been validated for use by accredited forensic laboratories. 

The first study that employed pyrosequencing for body fluid/tissue identification was 
conducted by Madi et al. (2012) Gene-specific tDMRs were identified for blood, semen, 
and saliva. The target CpG sites at the FGF7 gene (8 CpG sites) displayed semen-specific 
hypermethylation, while 5 CpG sites at ZC3H12D showed semen-specific hypomethy- 
lation. The C20orf117 locus differentiated blood from sperm, saliva, and epithelial cells, 
and the BCAS4 loci were identified as a saliva hypermethylation marker (Table 17.1). 

In another study by the same research group, additional body tissues and loci were 
incorporated to identify spermatozoa-specific methylation markers by pyrosequencing. 
Blood, saliva, semen, and epithelial cells were used by Madi et al. (2012), whereas buccal 
cells, skin epidermis, and vaginal epithelial cells were added on by Balamurugan et al. 
(2014). Tissue-specific differential methylation was analyzed at the following loci: 
DACT1, USP49, DDX4, Hs_INSL6_03, Hs_ZC3H12D_05, and B_SPTB_03 by 
pyrosequencing. The markers displayed hypomethylation in semen and hypermethyla- 
tion in blood, buccal cells, skin epidermis, and vaginal epithelial cells, except for one 
marker, B_SPTB_03, which displayed hypermethylation in semen as compared to the 
other tissues. The remaining three markers, C20orf117, BCAS4, and ZC3H12D, pro- 
duced the same results as Madi et al. (2012). However, FGF7 did not reproduce previous 
results (Table 17.1). 

Though pyrosequencing has been the method of choice for methylation-based body 
fluid identification, Bartling et al. (2014) applied Wumina MiSeq to identify semen, 
saliva, skin epidermis, and blood pooled in a multiplex assay. A set of eight loci, namely, 
L91762, L68346, L50468, L14432, L4648, L62086, L76138, and L36599, were selected 
from the targets used by Frumkin et al. (2011) ina previous study. A k-NN algorithm was 
employed to assign samples to the tissue source groups based on the normalized methyl- 
ation levels. The study marked the promotion of epigenetic identification of body fluids 
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Table 17.1 Summary of studies reported for forensically relevant body fluid identification using MPS 
methods as discussed in the text. 


NGS method 


Pyrosequencing 


Pyrosequencing 


Illumina Miseq 


Pyrosequencing 


Pyrosequencing 


Pyrosequencing 


Pyrosequencing 


Sample size/tissues/body 


fluids included in the 
study 


n= 10 (blood), n= 11 


(Saliva), n = 10 
(sperm), n = 11 
(epithelial cells) 
n= 10-11 (blood), 
n= 10-11 (buccal 
cells), n = 11-21 


(skin epidermal cells), 


n= 9—11 (semen), 
n= 10 (vaginal 
epithelial cells) 

n = 4 each of saliva, 
venous blood, skin, 
and semen. 

n= 100 consisting of 


Markers identified for 
tissue/fluid 


Blood, saliva, and sperm 


Semen 


Semen, saliva, skin 
epidermis, and blood 


Semen 


whole blood, saliva, 
buccal cells, seminal 
fluid, vaginal fluid, 
menstrual secretion, 
skin, and urine. 


n = 8—12 per body 


fluid: Blood, buccal, 
vaginal swabs, and 
semen 


n = 80, 20 each of 


blood, saliva, and 
vaginal secretions 


n = 23 (venous blood), 
n = 24 (buccal swabs), 


n= 22 (vaginal 
secretions), n = 20 
(semen containing 
sperm), n = 5 (saliva), 
n = 3 (skin/sweat), 


n= 3 (semen without 


sperm), n = 3 
(menstrual blood) 


Vaginal epithelia 


Blood, saliva, semen, and 


vaginal secretions 


Semen, saliva, and blood 


References 


Madi et al. (2012) 


Balamurugan et al. 
(2014) 


Bartling et al. 
(2014) 


Vidaki et al. (2016) 


Antunes et al. 
(2016) 


Park et al. (2014) 


Alghanim et al. 
(2020) 
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from the conventional CE-based method to second-generation NGS. Additionally, this 
method could be combined with STR typing using NGS. In a similar study, Gauthier 
et al. (2019) implemented agglomerative hierarchical cluster analysis to create a model 
to indicate sample origin. A total of 5 CpG sites were selected at the loci BCAS4 
(Madi et al., 2012), cg06379435 (Park et al., 2014; Silva et al., 2016), VE_8, and 
ZC3H12D (Madi et al., 2012) to identify saliva, blood, vaginal epithelia, and semen, 
respectively. 

Instead of relying on previously reported differentially methylated loci, most of which 
were in the genic region, Park et al. (2014) identified novel CpG sites by using the Illu- 
mina Human Methylation 450 K bead array, which includes over 450,000 CpG sites. 
From a total of 2986 body fluid-specific tDMSs, a final eight tDMSs were selected 
and evaluated by pyrosequencing on DNA samples from blood, saliva, and vaginal fluid. 
The public dataset collected from GEO was used to obtain information for semen 
markers. The eight candidate tDMSs were strongly hypermethylated in the target 
body fluid compared to the other body fluids, which is always preferable to the markers 
with relative differential methylation. The following CpG sites were identified as body 
fluid-specific: (cg23521140 and cg17610929): semen-specific; (cg06379435 and 
cg08792630): blood-specific; (cg26107890 and cg20691722): saliva-specific; and 
(cg0177489 and cg14991487): vaginal fluid-specific. In addition to the single CpG sites 
discovered, the authors also analyzed neighboring CpG sites, which too displayed a 
hypermethylation profile similar to the target CpG site. The blood-specific CpG sites 
associated with age were removed to eliminate age-based bias in the identification of 
blood. 

A combination of pyrosequencing and quantitative PCR/high-resolution melt assays 
was developed by Alghanim et al. (2020) for the identification of semen with sperm, 
semen without sperm, buccal swabs, saliva (oral fluids), venous blood, menstrual blood, 
vaginal secretions, and sweat/skin samples. Two novel semen-specific markers (NMUR2 
and UBE2U) were shown to be effective in discriminating seminal stains containing 
sperm from other tissues, whereas one saliva (SA-6) and one blood (AHRR) marker 
were useful in identifying saliva and blood, respectively. 

To identify vaginal epithelium, Antunes et al. (2016) identified a tDMR (9 CpG sites) 
in gene PFN3. Even though a significant difference in methylation was observed in 
vaginal epithelia when compared to blood, saliva, and semen, no clear hypo or hyperme- 
thylation was reported. 

To include epigenetic markers in routine forensic work, testing their sensitivity and 
specificity is an utmost requirement. This necessitates proper developmental validation, 
similar to what is done for every STR marker or kit. To this effect, Silva et al. (2016) 
conducted developmental validation for three methylation-based semen (ZC3H12D), 
saliva (BCAS4), and blood (cg06379435)-specific markers. The study revealed that 
markers were primate-specific, and the sensitivity of detection varied from 0.1 to 
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10 ng of the starting DNA concentration. Additionally, the markers provided a consistent 
methylation profile for degraded and inhibited samples. Mixed body fluid samples 
showed varying methylation levels, and the presence of mixtures could be detected. 
However, complete mixture deconvolution was not achieved. Similarly, validation of 
two semen-specific markers (cg04382920) and (cg11768416) was carried out by Vidaki 
et al. (2016) to test the sensitivity and efficacy of the markers to identify aged semen sam- 
ples. The pyrosequencing assay could successfully identify semen in both fresh and old 
semen samples (Table 17.1). 

A number of DNA methylation markers have now been identified for forensically 
relevant body fluid identification. Hence, the next step should be to amalgamate all 
the markers and use MPS in a defined and validated workflow. An extensive collabora- 
tive exercise should be undertaken to assess the reproducibility of the results of methyl- 
ation markers in different laboratory settings before they can be incorporated into the 
routine forensic workflow. Once such an exercise was performed by Jung et al. (2016) 
wherein seven laboratories participated to conduct a multiplex methylation SNaPshot re- 
action composed of seven CpG sites as markers for the identification of four body fluids, 
including blood, saliva, semen, and vaginal fluid. Each laboratory used four different kits 
for DNA extraction, five different quantification methods, two different kits for bisulfite 
conversion, two different PCR buffers for multiplex PCR, three different thermocyclers, 
three different genetic analyzers, three different analytical software, and five different 
analytical settings. 

The performance and comparison between laboratories was assessed in four 
different steps, namely: (1) CE of a purified single-base extension reaction product; 
(2) multiplex PCR of bisulfite-modified DNA; (3) bisulfite conversion of genomic 
DNA; and (4) extraction of genomic DNA from body fluid samples. The results 
revealed that 6 out of 7 laboratories produced consistent results even though variations 
due to types of kits and instruments existed. In a recent follow-up study by Lee et al. 
(2021), a technical validation was conducted where a methylation SNaPshot assay was 
employed for the identification of body fluids and estimation of age at 12 different lab- 
oratories. Regarding body fluid identification, in spite of differences in the amount of 
bisulfite DNA used and the threshold level for a positive methylation signal, comparable 
results were obtained by all of the 12 laboratories, and the correct body fluid (semen, 
blood, and saliva) was successfully identified. The authors noted that the input volume 
of bisulfite-converted DNA used for PCR amplification appeared to affect the success 
of the analysis, and the PCR buffer composition affected the measurement of DNA 
methylation. 

A similar exercise was undertaken by Ingold et al. (2018), wherein a total of 33 
mRNA body fluid markers for blood (6), semen (6), saliva (6), vaginal fluid (4), men- 
strual blood (5), and skin (6) were employed and tested on the MiSeq platform, and 29 
markers for blood (4), semen (6), saliva (6), vaginal fluid (4), menstrual blood (4), and 
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skin (5) were analyzed by the Ion Torrent platform. A total of 17 laboratories partic- 
ipated in the exercise; 10 used the MiSeq/FGx platform and 9 used the PGM/S5 plat- 
form (2 laboratories did the experiments on both platforms). Furthermore, epigenetic 
markers should also be tested in combination with other genotyping markers on an 
MPS platform to assess how they complement each other. Such a collaborative exercise 
was conducted by Ingold et al. (2020), wherein 35 RNA-cSNPs and 33 mRNA 
markers were tested in two different multiplex assays on MiSeq/FGx platforms in 
nine laboratories. 


Age estimation by MPS 


DNA methylation patterns change with increasing age and contribute to age-related dis- 
eases (Bocklandt et al., 2011). Epigenetic clocks use predictable changes in the epige- 
nome, usually DNA methylation, to estimate chronological age with unprecedented 
accuracy (Ryan, 2021). Several articles have reported age predictors based on DNA 
methylation levels in specific tissues, for example, saliva or blood (Bocklandt et al., 
2011; Hannum et al., 2013); however, research on age estimation using DNA methyl- 
ation attracted the forensic world since 2013, with the report of the epigenetic clock, 
which could be used as a multi-tissue age predictor by Horvath (2013). He identified 
and characterized a total of 353 CpG sites that together form an aging clock. 

To develop an epigenetic age prediction model, two sample groups are required: a 
training group and an evaluation group. A training group is used for the initial building 
of the model, and an evaluation or validation group is used to evaluate the prediction’s 
performance. Both pyrosequencing and Illumina MiSeq have taken center stage for 
forensic age estimation. The studies below detail the evolution of DNA methylation- 
based age prediction. 

Before NGS was used for age prediction, Garagnani et al. (2012) employed an Ilu- 
mina Infinitum HumanMethylation450 bead chip to identify age-associated CpG sites in 
genes ELOVL2, FHL2, and PENK from the whole blood of 32 mother-offspring cou- 
ples aged 42—83 and 9—52, respectively. The CpG islands of ELOVL2, FHL2, and 
PENK in the gene promoters were found to be hypermethylated in mothers compared 
to the offspring. The results were also replicated in 494 individuals (245 men and 249 
women) ranging in age from 9 to 99 years by Sequenom’s EpiTYPER assay. 

Later, Weidner et al. (2014) identified three age-related CpGs located in the genes 
ITGA2B, ASPA, and PDE4C by bisulfite pyrosequencing of 151 blood samples that 
could robustly predict age with a mean absolute deviation (MAD) from chronological 
age of fewer than 5 years. This approach requires locus-specific pyrosequencing and is 
cost-effective. The study also revealed that patients with acquired aplastic anemia or dys- 
keratosis congenita were predicted to be prematurely aged due to progressive bone 
marrow failure and severe telomere attrition (Table 17.2). 


Table 17.2 Summary of studies reported for age prediction by MPS methods as discussed in the text. 


NGS method 


Pyrosequencing 
Pyrosequencing 


Pyrosequencing 


Pyrosequencing 


Pyrosequencing 


Pyrosequencing 


Pyrosequencing 


Illumina MiSeq 


Sample size/training set 
(T) + validation set 

(V) for age prediction 
model development*® 


n= 82 (I) + 69 (V) 

n= 44 (saliva), n = 23 
(blood) 

n= 169 deceased, 
n = 37 living. n = 29 
healthy erupted third 
molars 


303 (T) + 124 (V) 


300 (T) + 120 (V) 


535 (T) + 230 (V) 
Saliva: 52 (T) + 39(V), 
blood: 40 (T) + 32 

(Vv) 

n= 1156 (blood) 

n= 265 (saliva) 

159 (I) + 53 (V) + 53 
(blind test) 

n= 106, blood samples 
from 53 
monozygotic twin 
pairs 

n= 1011, blood 
samples, (577 females 
and 434 males) from 
diseased donors 


Target tissue/body fluid 


Blood 
Saliva, blood 


Blood 


Blood 
Blood, saliva 


Blood, saliva 


Age range 


O—78 years 
5—72 years 


O0—91 years, dental 
samples: 19—70 years 


2—75 years 


11—90 years 
9—73 years, blood: 
5—72 years 


Blood: 2—90 years 
Saliva- 21—55 years 
Monozygotic twins: 
33-77 years 
Diseased samples: 
17—91 years 


References 


Weidner et al. (2014) 

Soares Bispo Santos 
Silva et al. (2015) 

Bekaert et al. (2015b) 


Zbie¢-Piekarska, 
Spolnicka, Kupiec, 
Makowska, et al. 
(2015) 

Zbie¢-Piekarska, 
Spolnicka, Kupiec, 
Parys-Proszek, et al. 
(2015) 

Park et al. (2016) 

Alghanim et al. (2017) 


Vidaki, Ballard, et al., 
(2017) 


Continued 
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Table 17.2 Summary of studies reported for age prediction by MPS methods as discussed in the text.—cont’d 


NGS method 
Pyrosequencing 
Pyrosequencing and 


illumina MiSeq 
MiSeq 


MiSeq/FGx 
MiSeq/FGx 


MiSeq 


Pyrosequencing 


Sample size/training set 
(T) + validation set 

(V) for age prediction 
model development*® 


100 (T) + 36 (V) 
84 


144 (29 each tissue/ 
fluid) except n = 28 
(buccal swabs) 

31 

n= 160 (blood), 
n= 160 (buccal 
swab), n= 161 
(bone) 

112 (T) + 48 (V) (blood 
and buccal 
cells) + 49 (V) 
(bones) 

90 exogenous deaths 

60 (T) + 30 (V) 

n= 73 deceased 68.5% 
male) 


n= 142 healthy living 
41.5% male). 


Target tissue/body fluid 


Postmortem blood 
Blood 


Samples of brain, bone, 
muscle, buccal swabs, 
and whole blood 

Plucked hair 

Blood, buccal cells, 
bone 


Postmortem blood 


Buccal swabs in 
different stages of 
decomposition 


“Number of samples used for age prediction model development are indicated in italics. 


Age range 

18—60 years 

18—99 years 

O—87 years 

5—88 years 

Blood: 1—75 years, 
Buccal cells: 


2—80 years, bone: 
19—93 years. 


2-weeks—91 years 


Deceased: 0—90 years 


Healthy: 0—89 years 


References 


Sukawutthiya et al. 
(2021) 

Freire-Aradas et al. 
(2020) 

Naue et al. (2018) 


Naue et al. (2021) 
Wozniak et al. (2021) 


Guan et al. (2021) 


Koop et al. (2021) 
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Age-correlated hypermethylated CpG sites were reported by Soares Bispo Santos 
Silva et al. (2015) in GRIA2 (3 CpG sites) and NPTX2 genes (6 CpG sites). The 
MAD was 6.9 and 9.2 for the GRIA2 and NPTX2 loci, respectively. Older individuals 
displayed increased methylation in both markers compared to younger individuals, and 
this trend was more pronounced in the GRIA2 locus when compared to NPTX2. 

As pyrosequencing-based age determination assays developed, both living and 
deceased blood samples were utilized by Bekaert et al. (2015a), and results demonstrated 
that methylation of CpG sites in the ASPA and EDARADD genes was negatively corre- 
lated with age, whereas positive correlations were found between methylation and age 
for PDE4C and ELOVL2. The multivariate quadratic regression model predicted donor 
age with a mean error of 3.75 years. Additionally, the authors conducted an age estima- 
tion using dentin samples with a MAD of 4.86 years. 

In addition to the frequently used gene ELOVL2 Zbie¢-Piekarska, Spolnicka, 
Kupiec, Parys-Proszek, et al. (2015) identified additional age-correlated CpG sites at 
Clorf132, TRIM59, KLF14, and FHL2 loci. With a training set of 300 individuals, a 
MAD of 3.4 years was observed, and 120 individuals were used to develop the validation 
set with a MAD of 3.9 years. The age predictor was superior for young individuals as 
compared to older ones. Also, the gender difference was reported in the training set, 
with females showing lower MAD (Table 17.2). 

DNA methylation is population-specific (Galanter et al., 2017; Kader, Ghai, & Zhou, 
2020). To study age prediction in the Asian population, Park et al. (2016) evaluated 
methylation levels at three age-correlated CpG sites, namely cg16867657 (ELOVL2), 
cg04208403 (ZNF423), and cg19283806 (CCDC102B), by pyrosequencing. A total 
of 535 samples were used to develop an age prediction model with a MAD of 3.156 
years, and the validation set consisted of 230 samples with a MAD of 3.346. Increased 
accuracy was observed for the <60-year-old age group and 57.30% in the older group 
(=60 years). 

Meanwhile, Guan et al. (2021) were interested in examining the reliability of epige- 
netic age markers in postmortem blood and specifically in the Japanese population. DNA 
methylation analysis was done using the [lumina MiSeq system to identify age-associated 
CpGs. Four CpG sites on four genes—ASPA, ELOVL2, ITGA2B, and PDE4C—were 
found to be highly correlated with chronological age. Whole blood samples were 
collected from 2-week-olds to 91-year-olds during autopsies, and it was ensured that 
the cause of death was extrinsic (due to strangulation and lethal trauma and not from 
any disease) and the postmortem interval for each subject was less than 10 days to avoid 
severe DNA fragmentation and contamination (Table 17.2). 

All epigenetic age models are generally based on regression-based prediction with a 
certain range of MAD. In order to increase the model accuracy and decrease error, 
Vidaki, Ballard, et al. (2017) utilized machine learning algorithms, specifically their arti- 
ficial neural networks (ANNs), combined with NGS (Illumina MiSeq) for age 
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prediction. ANNs are a group of machine-learning algorithms inspired by biological sys- 
tems. This is the first study that used machine learning, via ANNs, together with an 
NGS-based DNA methylation detection method for forensic age prediction. Both blood 
and saliva samples were used for the study. The age prediction model was validated by an 
independent cohort of 53 monozygotic twin pairs aged 33—77 years. Few between-pair 
twins were predicted to be either much older or much younger than their actual age, but 
the prediction differences within twin pairs were not statistically significant. The authors 
also investigated a set of diseased samples and their effect on age prediction. The mean 
absolute errors for each disease differed as follows: type I diabetes—8.63 years; 
anemia— 14.38 years; bone marrow disorders (including leukemia)—11.09 years; ovarian 
cancer—7.45 years; breast cancer—6.77 years; and schizophrenia—5.03 years, which 
could influence epigenetic age prediction. 

Additional age prediction models were developed using bisulfite pyrosequencing for 
forensic age estimation by Alghanim et al. (2017) targeting CpG sites within genes 
SCGN, DLX5, and KLF14, Sukawutthiya et al. (2021) used only 2 CpG sites in the 
gene ELOVL2 to obtain good age indicators in the age groups of 20—40 years and 
over 40 years. Decedent blood was used by Anaya et al. (2021) to predict chronological 
age by bisulfite pyrosequencing analysis of single CpG sites on five genes: KLF14, 
ELOVL2, C1orf132, TRIM59, and FHL2. It was observed that the prediction potential 
of the models decreased with the increasing age of the person, particularly for the age 
categories above 50 years. 

It is vital to compare the developed age prediction models in an independent valida- 
tion study using the same population to select the best one for forensic applications. 
Hence, Daunay et al. (2019) evaluated and compared six prediction models (Bekaert 
et al., 2015a; Park et al., 2016; Thong et al., 2017; Weidner et al., 2014; Zbie¢-Piekarska, 
Spdlnicka, Kupiec, Makowska, et al., 2015; Zbiec-Piekarska, Spdlnicka, Kupiec, Parys- 
Proszek, et al., 2015) in an independent study using 100 blood samples from French in- 
dividuals aged between 19 and 65 years. The Bekaert model presented the best overall 
performance and accuracy for age prediction (MAD of 4.5 and SEE of 6.8), followed 
by the Thong model (MAD of 5.2 and SEE of 7.2). However, for forensic purposes, 
the Zbiec-Piekarska 1 model was found to be most suitable as it allows quick and accurate 
age estimation even with low quantities of DNA, based on just 2 CpG sites located at a 
single locus. 

Four different technologies: EpiTYPER, pyrosequencing, MiSeq, and SNaPshot were 
utilized for epigenetic age estimation of DNA from 84 blood samples from healthy Euro- 
peans ranging in age from 18 to 99 years (Freire-Aradas et al., 2020). DNA methylation 
was analyzed at 4 CpG sites for three genes, viz., ELOVL2, FHL2, and MIR29B2, which 
have been previously reported to be age-specific (Jung et al., 2019; Zbiec-Piekarska, 
Spolnicka, Kupiec, Parys-Proszek, et al., 2015). At ELOVL2 and FHL2 loci, similar 
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patterns of DNA methylation for EpiTYPER, pyrosequencing, and MiSeq were 
observed, and subsequently, data from these techniques can be used in platform- 
independent age prediction models. 

Several reports have been documented above on age prediction models based on age- 
correlated CpGs. However, to enable the routine application of methylation-based age 
prediction, standardization and reproducibility of these methods are necessary. There- 
fore, Pfeifer et al. (2020) tested two different published age prediction models for blood 
and mouth swab samples with regards to prediction accuracy (Bekaert et al., 2015a, 
2015b). 

A total of 151 whole blood samples (aged 1 day—96 years) of deceased individuals and 
149 mouth swabs (aged 13 days—95 years) of living individuals were collected from 63 or 
60 female (blood/buccal swabs) and 88 or 89 male (blood/buccal swabs) donors. The 
mean time between death and sample collection was 6 days, with a minimum of less 
than 24 h and a maximum of 20 days. Additionally, freshly taken blood samples of 21 
individuals (20—65 years old) were collected from donors with disease states that included 
different cancers, mental and physical handicaps, different levels of diabetes, lung diseases 
(chronic bronchitis and asthma), schizophrenia, and coronary heart diseases. Moreover, 
drug and alcohol abuse could be documented for 20 deceased individuals. Results gener- 
ated revealed that the CpG site in the ELOVL2 gene was the most informative age- 
associated marker, which displayed hypermethylation with increasing age. CpG sites in 
genes ASPA, CCDC102B, C1orf132, ITGA2B, KLF14, and PDE4C displayed a mod- 
erate correlation with age. The postmortem period and diseases had no effect on DNA 
methylation (Table 17.2). 

The authors demonstrated that the interlaboratory variance of variable sampling, 
different DNA inputs, bisulfite conversion systems, PCR methods, and analysis methods 
greatly influence the quantification of DNA methylation. To mitigate the effect of the 
above variables, the use of standards to normalize DNA methylation results for better 
comparison with other studies could be helpful. 

In addition to blood and saliva, MPS has been utilized to determine age using not-so- 
commonly used tissues like the brain, bone, muscle, buccal swabs (Naue et al., 2018; 
Wozniak et al., 2021), and hair (Naue et al., 2021). The hair was plucked from 49 
deceased individuals, both male and female (aged 5—88). Hair was from different phases 
of the hair cycle, i.e., the growing phase, transition phase, and resting phase, as given at 
the time of the individual’s death. Samples were sequenced on a MiSeq FGx (Verogen, 
San Diego, CA, USA). From a total of 49 samples, only 31 could be used for methylation 
determination, as 18 samples showed low sequencing coverage or bisulfite conversion ef- 
ficiency below 98%. This could be because of the use of hair, which often contains a low 
quantity of DNA or fragmented DNA. A high correlation between the DNAm and age 
was observed for the ELOVL2, KLF14, RPA2, TRIM59, and ZYG11A loci. 
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In a proof-of-principle study, Koop et al. (2021) used buccal swabs under different 
postmortem conditions, viz., with visible signs of no decomposition, early decomposi- 
tion, advanced decomposition, and severe decomposition for epigenetic age estimation 
by pyrosequencing at the age-associated CpG-1 site of the PDE4C (upstream of 
cg17861230) gene. Postmortem samples, except severely decomposed ones, exhibited 
a high correlation of methylation levels with age, independent of the state of decompo- 
sition, which was comparable to the correlation for samples from live donors. 

The gradual process by which epigenetic markers could be incorporated into the cur- 
rent forensic workflow was envisioned by Heidegger et al. (2022) who presented the first 
interlaboratory validation of a single targeted bisulfite MPS assay (Illumina MiSeq) for 
forensic age prediction. The assay was validated by five consortium laboratories of the 
VISible Attributes through GEnomics (VISAGE) project. Two panels of age-specific 
markers were tested. One consisted of 13 markers (ET-13), and the second consisted 
of 5 markers (ET-5). The study thoroughly looked at the reproducibility, species spec- 
ificity, and sensitivity of producing DNA methylation profiles from semen samples. 
Both the 13 and 5-marker sets performed well. Methylation quantification was repro- 
ducible between laboratories except for the fact that variability in methylation quantifi- 
cation results increased below 50ng DNA, thus affecting the difference between 
technical duplicates. 


Improvements in the application of epigenetic markers via MPS 


Though MPS allows multiplexing and analysis of highly degraded DNA samples, it is 
necessary to improve the analytical tools in order to obtain more reliable results for epige- 
netic markers. Currently, many analysis algorithms are not standardized (Bruijns et al., 
2018; Jordan & Mills, 2021). Additionally, for some target CpGs, a large interindividual 
variation in methylation levels has been observed (Vidaki et al., 2016), since DNA 
methylation patterns are dynamic and can be influenced by environmental factors. 
Hence, several confounding factors could affect identification (e.g., body fluids) and pre- 
dictions (e.g., age estimation) and must be taken into account (Jordan & Mills, 2021). 


Future considerations 


MPS or NGS methods have successfully been integrated into CE-based workflows for 
forensic human identification. The next step is to set up a parallel platform for the 
application of epigenetic markers in forensic work that can complement the existing 
technology. The most essential prerequirement for any methylation-based marker is 
bisulfite conversion of DNA, which highly deteriorates DNA’s quality. Therefore, 
methylation detection methods need to be verified and evaluated to meet the require- 
ments of high sensitivity for low-template DNA samples. To improve reproducibility 
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among laboratories, the standard workflow of forensic epigenetic analyses is an impor- 
tant factor to be considered. Pyrosequencing for body fluid identification and age esti- 
mation has been extensively verified and validated. In the future, it would be 
worthwhile to multiplex markers for body fluid identification and age estimation 
into a single reaction. Similar to the collaborative exercise conducted by several labs 
for mRNA-based body fluid identification using MPS methods, forensic labs should 
also collaborate to test epigenetic markers across laboratories to evaluate the MPS 
application in an extended group. 

Age-prediction models should include data on standards with known DNA methyl- 
ation values for every pyrosequencing assay for normalization of DNA methylation data 
for a better comparison between studies. Furthermore, DNA methylation differences be- 
tween ethnicities/ancestries should be identified, and markers unaffected by population 
differences should be developed to ensure accurate age estimation. Moreover, a system- 
atic evaluation of the different age prediction models on the same population should be 
performed in order to compare their performance and identify the model with the best 
age prediction accuracy. Validation studies are essential to understanding technical assay 
variability. 

NGS allows improved resolution and simultaneous analysis of STRs, SNPs, and Y- 
and X- chromosome markers. In the future, epigenetic methods such as the DNA 
methylation profile at target CpGs will complement STRs and SNPs. A new approach 
of CpG-SNPs for individual-specific body fluid identification was reported by Watanabe 
et al. (2022). Even though this approach is still in its infancy, the use of CpG-SNPs could 
be an effective method to indicate SNPs associated with body fluids in a mixture. Addi- 
tionally, DNA methylation assays could be used to demonstrate the lifestyle of individ- 
uals, such as smoking status (Alghanim et al., 2018). 
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Introduction 


Forensic scientists have long been using nucleic acids (DNA as well as RNA) to produce 
conclusive evidence in cases of sexual assault, homicide, and all the other cases where any 
biological evidence is recovered from the victim, suspect, or crime scene (Jawad et al., 
2020). With the development of the DNA profiling technique by Sir Alec Jeffreys in 
1984, DNA polymorphism gained recognition as a tool for crime investigation. This 
technique based on the variable number of tandem repeats helped in solving a double 
murder case (in Leicestershire, UK) by not only identifying the suspect but also saving 
an innocent (Li, Norman, & Schober, 2015). This demonstrated the huge potential 
for the use of DNA profiling in forensic and criminal investigations. With the advent 
of the polymerase chain reaction (PCR) by Kary Mullis, additional molecular techniques 
were developed for DNA profiling. These techniques include amplified fragment length 
polymorphism and short tandem repeat (STR) profiling, as well as the more recent single 
nucleotide polymorphism (SNP), mitochondrial DNA profiling, and Y-STR (Y 
chromosome-specific STR) profiling. 

The DNA profile can help in identifying a person from a biological sample, but it is 
mostly unable to identify the source, nature, and type of the evidence material. This in- 
formation can further help to clarify the accusation and support the link between the sus- 
pect and the evidence material. It can be reliably done using RNA testing (van den Berge 
& Sijen, 2017). The RNA testing utilizes the application of different types of RNAs, 
including mRNA, tRNA, and rRNA, along with miRNA, antisense RNA, siRNA, 
and snRNA (Jawad et al., 2020). The biological functions of these RNAs may vary 
from protein biosynthesis to posttranscriptional gene regulation. MicroR NAs (miRNAs) 
can be defined as short (18—24 nucleotides long) and noncoding RNA molecules that 
play an important role in gene expression regulation posttranscriptionally. The first 
miRNA was discovered in the roundworm Caenorhabditis elegans in the year 1993. The 
subsequent research on miRNAs has revealed the crucial regulatory roles of miRNAs 
in gene expression in plants as well as animals (Lagos-Quintana et al., 2001; Lee & 
Ambros, 2001). It is estimated that approximately 3% of human genes encode miRNAs. 
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Further, up to 60% of human genes are probable targets of these miRNAs (Git et al., 
2010). The data from MiRbase, the bioinformatic database of miRNA sequences, has 
revealed that 1917 miRNAs are encoded by the human genome (Rocchi et al., 2021). 
The majority of the miRNAs are found inside the cell, but a few miRNAs have been 
observed to be circulating in various body fluids, including tears, cerebrospinal fluid, 
plasma, and saliva. Such extracellular miRNAs are commonly referred to as circulating 
miRNAs (Sohel, 2016; Weber et al., 2010). The circ miRNAs are potential forensic bio- 
markers as they have high stability even at high temperatures, extreme pH, and chemical 
treatments, and as they are carried by the lipoprotein complexes, they are safe from endog- 
enous RNase activity (Layne et al., 2019; Mayes et al., 2019). The advancements in mo- 
lecular techniques have further helped forensic scientists to develop and validate newer 
RNA-based methods to solve problems that were unsolvable using conventional methods 
although various advanced molecular techniques like PCR, microarray, high-resolution 
melt analysis, etc. have been employed by the researchers (Jawad et al., 2020). Further, 
next-generation sequencing (NGS) and single-cell sequencing have helped the researchers 
to develop reliable RNA-based methods for the identification of body fluids, estimation 
of postmortem interval (PMI), identification of tissue type, determining the age of the 
wound, distinguishing antemortem and postmortem drowning, gender determination, 
menstrual blood identification, determining drug abuse, and identifying and discrimi- 
nating monozygotic twins (Rocchi et al., 2021). The further sections will individually 
discuss all the aforementioned applications of miRNAs. 


miRNA and body fluid identification 


Body fluids are one of the most common types of biological evidence found at crime 
scenes, especially in cases of homicide, suicide, and sexual assault. They may include 
blood, semen, saliva, urine, and vaginal secretions. The identification of the type of 
body fluid retrieved from the crime scene is important to determine the nature of the 
crime, offense reconstruction, relate the suspect to the crime scene, and identify 
the culprit. Therefore, the very first step in the forensic investigation is to determine 
the type of body fluid, subsequently followed by DNA profiling to determine the iden- 
tity of the culprit (Li, Norman, & Schober, 2015). The current presumptive and confir- 
matory methods used for the identification of the body fluids utilize the different 
biochemical properties of these fluids to give them color and fluorescence or generate 
particular compounds. These methods are rapid and show varying levels of sensitivity 
and specificity. But the fact worthy of mention here is that some of these screening tests 
are destructive in nature (Harbison & Fleming, 2016; Rocchi et al., 2021; Virkler & Led- 
nev, 2009), which means that a part of the biological evidence will be consumed during 
these tests. This is a great matter of concern, as most of the time, a small amount of bio- 
logical evidence is retrieved in criminal cases, and after the preliminary examination, the 


Forensic applications of NGS-based microRNA analysis 


same evidence is used for DNA profiling. So, forensic scientists cannot afford to lose part 
of the evidence. Therefore, it is very important to devise methods that use a small amount 
of samples and provide confirmatory results. 

Alternative methods proposed for body fluid identification (BFID) were DNA 
methylation profiling and mRNA analysis. But DNA methylation has low specificity, 
and mRNAs are susceptible to degradation due to cellular, extracellular, and environ- 
mental factors (temperature, humidity, etc.) (An et al., 2012; Kader et al., 2020; Lee 
et al., 2016; Rocchi et al., 2021). This led the researchers to further explore the advanced 
methods for BFID. Forensic researchers in the last decade have proposed a number of 
miRNA-based biomarkers for cell-type identification. The initial studies on forensic ap- 
plications of miRNAs focused on identifying miRNAs that are specific to a body fluid, 
but most of the miRNAs lacked this specificity (Rocchi et al., 2021). Therefore, compar- 
ative miRNA profiling was proposed for better identification of body fluids. The small 
size of the miRNAs and their feasibility to be coextracted with DNA make the miRNAs 
an ideal candidate for their use in forensic casework. Two separate studies have reported 
coanalysis of DNA profiling and BFID. Li, Zhang, et al. (2014) devised a four-miR NA 
marker-based BFID method that could discriminate between semen, venous blood, and 
menstrual blood. Along similar lines, Mayes et al. (2018) designed an eight-marker-based 
system that could differentiate semen, saliva, venous blood, and menstrual blood. Both of 
these methods yielded complete DNA STR profiles. Individual analysis of miRNA 
profiling from different body fluids has been performed by various researchers. This anal- 
ysis was performed using PCR-based methods, microarrays, and NGS. Different authors 
reported different miRNA biomarkers for BFID, and some of the miRNA show over- 
lapping between different body fluids (Rocchi et al., 2021). The collective list of various 
miRNAs reported in the literature for BFID has been provided in Table 18.1. 

From this table, it can be observed that multiple researchers have reported miR-451a 
to be a strong biomarker for blood identification, and along with this, miR-144-3p and 
miR-16 also seem useful biomarkers for venous blood identification. miR-214-3p and 
miR-412 have higher expression levels in menstrual blood as compared to other body 
fluids. miR-203a and miR-205-5p have been suggested to be useful in saliva identifica- 
tion. miR-891a, miR-10b, and miR-888 have been confirmed by different researchers as 
semen-specific makers, and miR-203a has been suggested for vaginal secretion 
identification. 


Discrimination of monozygotic twins 


Forensic scientists regularly come across cases that involve the identification of living, 
deceased, and compromised human remains. Human identification can be scientifically 
achieved using fingerprint, dental, anthropological, genetic, and radiological examina- 
tions (Nikam et al., 2015). With the advent of PCR, DNA fingerprinting became the 
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Table 18.1 miRNA biomarkers reported in literature for BFID. 


Body fluid 


Blood (venous) 


Saliva 


Semen 


Biomarker miRNA 


miR-144-3p 


miR-16 


miR-185 
miR-126 
miR-150 
miR-486 
miR-203a 
miR-943 
miR-200b 
miR-142-3p 
miR-205-5p 


miR-203a 


miR-658 
miR-142-3p 
miR-200c 
miR-26b 
miR-223 
miR-891a 


References 


Courts and Madea (2011), Fujimoto 
et al. (2019), Hanson et al. (2009), 
He et al. (2020a, 2020b), Li, 
Zhang, et al. (2014), Mayes et al. 
(2018), Omelia et al. (2013), Sirker 
et al. (2017) 

Fujimoto et al. (2019), He, Han, et al. 
(2020), Sauer et al. (2016), 
Zubakov et al. (2010) 

Fujimoto et al. (2019), Hanson et al. 
(2009), Wang, Zhang, et al. (2013) 

Zubakov et al. (2010) 

Courts and Madea (2011) 

Courts and Madea (2011) 

Wang, Zhang, et al. (2013) 

Sauer et al. (2016) 

He, Ji, et al. (2020) 

Seashols-Williams et al. (2016) 

Mayes et al. (2018) 

Courts and Madea (2011), Hanson 
et al. (2009), He, Han, et al. (2020), 
Mayes et al. (2018), Omelia et al. 
(2013) 

Courts and Madea (2011), Fujimoto 
et al. (2019), Hanson et al. (2009), 
Sauer et al. (2016) 

Hanson et al. (2009) 

Zubakov et al. (2010) 

Courts and Madea (2011) 

Seashols-Williams et al. (2016) 

Fujimoto et al. (2019) 

Belleannée et al. (2012), Fujimoto 
et al. (2019), He, Han, et al. (2020), 
Hu et al. (2014), Mayes et al. 
(2018), Sauer et al. (2016), 
Seashols- Williams et al. (2016), 
Wang, Zhang, et al. (2013), 
Zubakov et al. (2010) 

Belleannée et al. (2012), Fujimoto 
et al. (2019), He, Han, et al. (2020), 
Hu et al. (2014), Li, Zhang, et al. 
(2014), Luo et al. (2015), Wang, 
Zhang, et al. (2013) 
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Table 18.1 miRNA biomarkers reported in literature for BFID.—cont'd 


Body fluid Biomarker miRNA 
miR-10b 


miR-135b 
miR-135a 


miR-374 
miR-10a 
Menstrual blood miR-214-3p 
miR-412 
miR-205-5p 


miR-203a 


miR-144-3p 


miR-451a 
miR-943 
miR-1246 
miR-200b 
miR-141-3p 
Vaginal secretion miR-203a 


miR-124a 
miR-372 
miR-200c-3p 
miR-205-5p 
miR-654-5p 
miR-888 
miR-155-5p 
miR-1260b 


References 


Hanson et al. (2009), Mayes et al. 
(2018), Sirker et al. (2017) 

Hanson et al. (2009) 

Luo et al. (2015), Zubakov et al. 
(2010) 

Sirker et al. (2017) 

Fujimoto et al. (2019) 

He et al. (2020a, 2020b), Li, Zhang, 
et al. (2014), Wang, Zhang, et al. 
(2013 

Bexon and Williams (2015), Hanson 
et al. (2009), Mayes et al. (2018) 

Hanson et al. (2009), He, Ji, et al. 
(2020 

He, Ji, et al. (2029), Wang et al. 
(2015 

He, Han, et al. (2020), Sauer et al. 
(2016 

Hanson et al. (2009) 

Sirker et al. (2017) 

Seashols-Williams et al. (2016) 

Seashols-Williams et al. (2016) 

Mayes et al. (2018) 

Sauer et al. (2016), Sirker et al. 
(2017), Wang et al. (2015) 

Hanson et al. (2009) 

Hanson et al. (2009) 

Wang et al. (2015) 

Wang et al. (2015) 

He, Han, et al. (2020) 

He, Han, et al. (2020) 

Fujimoto et al. (2019) 

Fujimoto et al. (2019) 


most recognized technique for the genetic identification of humans using biological sam- 


ples. To date, DNA fingerprinting is the most popular, recognized, and scientifically 


accepted technique for discriminating between two persons from their biological samples 


in cases involving sexual assault, paternity disputes, and any other case where biological 


evidence is recovered. This was possible because of the uniqueness of the human genome 


for every person (Krawezka & Schmidtke, 2020). Despite this, DNA fingerprinting fails 


to identify and differentiate between monozygotic twins. This is because monozygotic 


twins have the same genetic makeup. Since monozygotic twins are formed from a single 
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zygote and are hence genetically identical (Budowle, 2014), their distinction remains a 
major challenge. Therefore, in cases where one of the monozygotic twins is related to 
the biological evidence, we cannot exclude the other one. This conundrum of “which 
one committed the crime?” creates a difficult situation for the legal as well as the forensic 
communities (Budowle, 2014). To overcome this problem, different methods have been 
devised for differentiating the monozygotic twins genetically. These methods include 
copy number variation, SNPs, the mitochondrial genome, and epigenetic-based methods 
like DNA methylation, RNA expression level, and protein level (TianYue et al., 2020). 
But, until now, no effective method has been developed that can provide an effective 
technique for monozygotic twin differentiation. 

There are some studies that have reported miRNAs as potential molecular markers for 
monozygotic twin identification. Perkins et al. (2007), Kim et al. (2010), and Sarachana 
et al. (2010), in different studies, investigated the expression of miRNAs in monozygotic 
twins to study the mechanism of diseases and found that monozygotic twins may differ in 
miRNA expression levels (Kim et al., 2010; Perkins et al., 2007; Sarachana et al., 2010). 
Chen Fang and colleagues investigated the miRNA expression profile in a pair of mono- 
zygotic twins and reported that 96 out of 509 miRNAs were very highly expressed and 
showed differential expression among monozygotic twins (Fang et al., 2019). Chao Xiao 
and colleagues studied the miRNA expression profiles of two pairs of monozygotic 
twins. A total of 74 miRNAs were found to be differentially expressed in the male mono- 
zygotic twins, while 220 miRNAs were differentially expressed in the female monozy- 
gotic twins (Xiao et al., 2019). Chen Fang et al. analyzed the expression profiles of 
miRNAs in four pairs of monozygotic twins (Fang et al., 2019). The results showed 
that out of the 158 miRNAs expressed detectably in each individual, 14% were differen- 
tially expressed among monozygotic twins (Fang et al., 2019). Xiao et al. (2019) inves- 
tigated the miRNA expression in the blood samples of seven pairs of monozygotic twins 
and found a varying number of differentially expressed miRNAs out of a total of 545 
miRNAs among and between monozygotic twins (Xiao et al., 2019). 

From these studies, it can be opined that miRNAs are potential genetic markers for 
differentiating monozygotic twins, but there is a lack of studies on healthy individuals and 
studies addressing intertissue variation of miRNA profiles. Also, studies have suggested 
that miRNA expression can be affected by race (Ma, Hong, et al., 2015). Therefore, 
further studies are needed to establish the feasibility of miRNAs as a genetic marker 
for monozygotic twin identification. 


miRNAs and time since death determination 


Accurate estimation of time since death (TSD) is one of the major challenges encoun- 
tered by investigating agencies during the investigation of deaths. The TSD estimation 
helps the investigators to determine the time of death, which can further help in 
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determining whether the case is a suicide or homicide, reconstructing the crime scene, 
and verifying the alibi of the suspects. The currently used methods for postmortem in- 
terval (PMI) estimation are majorly based on physical changes after death. Even the 
recently developed biochemical methods are unable to provide a precise TSD (Shrestha 
et al., 2021). Molecular changes in proteins, RNA, and DNA due to postmortem degra- 
dation could provide a more precise and reproducible determination of TSD (Li, Ma, 
et al., 2014; Li et al., 2016). Nucleic acids (DNA and RNA), being protected in the 
cell nucleus, are considered to be less damaged by external factors and thus may aid in 
a more precise determination of TSD. The degradation rate of DNA after death has 
been studied by different researchers, but a detailed study on different human tissues is 
required for precise PMI estimation (Itani et al., 2011; Johnson & Ferris, 2002; Perez- 
Martinez et al., 2017; van den Berge et al., 2016; Williams et al., 2015). Scrivano 
et al. (2019) have provided a detailed review of the studies conducted on the use of 
RNA in the estimation of the postmortem interval. Most of the studies have been con- 
ducted on animal tissues, and only a few studies on human samples have been reported. 

Although a lot of research has been conducted on the use of mRNAs for TSD esti- 
mation, it is worth noting that the postmortem expression of the mRNAs is highly 
affected by environmental factors such as temperature, humidity, sunlight, etc. Further, 
the cause of death as well as the type of tissue concerned also affect postmortem mRNA 
degradation. The ribonuclease action is higher in the pancreas and liver as compared to 
the brain, heart, and skeletal muscles. Considering the postmortem instability of the 
mRNAs, miRNAs can be a potential tool to estimate the precise TSD. Forensic re- 
searchers have shifted their focus toward using miRNAs as a potential biomarker for 
TSD estimation. Lesser susceptibility to degradation has led to great attention toward 
miRNAs. As the miRNAs are smaller in size as compared to mRNAs, they are more 
likely to be found intact postcease of life. Further, miRNAs are protected by lipid or pro- 
tein matrices, which add another layer of protection from endonucleases and ribonucle- 
ases (Rocchi et al., 2021). Several researchers have reported the stability of miRNAs even 
after treatment with varying environmental conditions. Layne et al. (2019) reported four 
miRNAs that were resistant to UV light exposure in blood, semen, saliva, and urine sam- 
ples. Similarly, two miRNAs were found to be stable in blood and semen even after 
6 months of differential environmental conditions of temperature, humidity, and sun- 
light (Layne et al., 2019; Mayes et al., 2019). 

Li et al. (2010) were the first researchers to report the potential use of miRNAs for 
TSD estimation. Although several studies have reported no correlation between miRNA 
expression and TSD in the early postmortem interval (0O—24 h), a number of researchers 
have reported a positive correlation of a number of miRNAs with TSD in the late 
(24—72 h) and advanced stages (>72h) of the postmortem interval (Rocchi et al., 
2021). miR-1-2 miRNA markers were identified in rat cardiac tissue for their potential 
use in TSD estimation from 96 to 120 h after death. Further, two miRNAs, namely 
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miR-206 and miR-133, from mouse liver tissue were suggested to be useful from 24 to 
48 h after death. The study on human bone samples revealed that let-7e and miR-16c 
can be used as potential biomarkers for TSD estimation in advanced stages of decompo- 
sition. Few researchers have investigated the expression levels of miRNAs regulating the 
circadian clock during the day and night to determine the time of death. Expression levels 
of four miRNA markers in human vitreous humor samples were found to be linked with 
daytime and nighttime. miR-142-5p and miR-541 had lower expression in the daytime 
and higher expression at nighttime. However, miR-106b and miR-96 had higher 
expression during the daytime and lower expression at nighttime. miR-142-5p and 
miR-219 from human blood had higher expression during the daytime and lower 
expression at nighttime (Rocchi et al., 2021). 

As the expressional analysis of the RNAs is done using PCR, it is important to set 
suitable control biomarkers. As discussed, mRNAs are highly unstable, and this renders 
it difficult to use them as control markers while studying the expression levels in postmor- 
tem tissues. Therefore, researchers explored the possibility of using miRNAs as reference 
or control biomarkers in postmortem studies. Various studies were conducted on 
different tissues from humans, rats, and mice, and a number of miRNAs were reported 
to be ideal for reference biomarkers (Rocchi et al., 2021). As miRNA expression varies 
among tissue types, the reference miRNAs should be selected as per the tissue under anal- 
ysis. A comprehensive list of the miRNAs reported in the literature to be linked with 
TSD has been provided in Table 18.2. 

It can be concluded that miRNA expression is associated with TSD in late PMI 
(24—144 h after death), and there’s relatively no correlation between miRNA expression 
and TSD in early PMI (0—24 h after death). It should be further noted that there is a need 
to increase the number of studies on human tissue samples with higher sample sizes. 


Wound age estimation using miRNA 


Medico-legal examination of wounds is necessary to determine the severity of the 
wounds. At the same time, it is also important to determine the age of the wound or 
the time since the wound was inflicted, as well as whether the wounds were inflicted 
antemortem or postmortem. It helps the investigators to verify the statements of the alibi 
and reconstruct the crime’s events. When an injury is inflicted, the body immediately 
starts the healing process. The healing process has been divided into four phases, which 
overlap each other. The four phases are the hemostasis, inflammatory, proliferative, and 
remodeling phases. Each phase has unique morphological and biochemical properties. 
The detection of the healing phase-specific characteristics has helped the researchers to 
determine the age of the wound. Various immunohistochemical markers, cytokines, che- 
mokines, adhesion molecules, different enzymes, and growth factors have been analyzed 
for determining the age of the wound (Casse et al., 2016). 


Table 18.2 miRNA markers reported in the literature for TSD estimation. 


Application 


TSD estimation 


Determining circadian 


rhythm 


For internal control/ 
Reference marker in 


PCR 


Source of 
tissue 


Human 
Rat 
Mouse 


Human 


Tissue type 


Bones 
Heart 
Liver 


Vitreous 
humor 


Blood 


Lung 
Muscle 
Brain 


Heart, liver, 
brain 

Skin 

Spleen 


Lung 
Muscle 
Brain 


Heart, liver 
Heart 


Liver 
Skeletal 


muscle 


miRNA 
Let-7e and miR-16 


miR-1-2 
miR-133a and miR-206 


miR-142-5p and miR-541 


miR-106b and miR-96 


miR-142-5p and miR-219 


miR-195 and miR-200c 
miR-1 and miR-206 
miR-9, miR-125b 


miR-1 and miR-133a 


miR-203 
miR-125b and miR-143 


miR-195 and miR-200c 
miR-1 and miR-206 
miR-9 and miR-125b 


miR-1 and miR-133a 
miR-122 and miR-133a 


miR-122 
nuR-133a 


Postmortem stability 
period 


1—6 months 
Up to 120 h 
24—48 h 


Low daytime high 
nighttime 

High daytime low 
nighttime 

High daytime low 
nighttime 

Up to 144h 

Up to 144h 

Up to 22h 


Up to 180 h 


Up to 120 h 

Up to 36 h (25°C) 
Up to 144 h (4°C) 

Up to 144h 

Up to 144h 

Up to 144h 


Up to 24h 
Up to 180 h 
Up to 180h 


Up to 180 h 
Up to 180 h 


References 


Na (2020) 
Li et al. (2010) 
Wang, Mao, et al. 


(2013 
Odriozola et al. 
(2013 
Corradini et al. 
(2015 
Corradini et al. 
(2015 
Lv et al. (2016 
Lv et al. (2016 
Lu, Ma, et al. 
(2016) 


Lv et al. (2017 


Pan et al. (2014) 
Ly et al. (2014 


Lv et al. (2016 

Lv et al. (2016) 

Ma, Pan, et al. 
(2015) 

Lu, Li, et al. (2016) 

Lv et al. (2017) 

Tu et al. (2018, 
2019) 

Tu et al. (2018) 

Tu et al. (2018) 
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Specific miR NAs have been reported to regulate the genes involved in the wound- 
healing process. This makes miRNAs a potential biomarker to estimate the age of 
wounds (Rocchi et al., 2021). Different miRNAs have been reported to be unique to 
the wound-healing phase. Table 18.3 provides a list of the miRNAs identified to be 
involved in various phases of wound healing. 

There is a paucity of literature on differences in antemortem and postmortem expres- 
sion profiles of miRNAs. Only a few studies have been published in this area of research. 
miRNA-21 and miR NA-205 have been reported to be significantly up-regulated in rat 
skin samples at 24 h after death as compared to 0 and 48 h after death. However, no cor- 
relation could be established between the TSD and miRNA expression levels (Ibrahim 
et al., 2019). Also, increased levels of two miRNAs involved in the inflammatory phase, 
namely miR125a-5p and miR125b-5p have been reported in the skin ligature marks of 
individuals who died by hanging. Further, increased levels of three miRNAs with an 
antiinflammatory role, miR92a-3p, miR128-3p, and miR130a-3p, were observed 
(Neri et al., 2019). 

It can be concluded that although a few studies have shown the potential candidature 
of miRNAs as biomarkers for wound age estimation but more research is required to 
establish miRNAs for practical applications for this purpose. 


Organ tissue identification using miRNA 


Internal organ tissues may be retrieved in cases of violent crimes such as stabbing and 
shooting. It may be found on the crime scene, a weapon used, or the body of the perpe- 
trator. The identification of the type of organ tissue can help in determining the level or 
depth of weapon penetration, the role of the evidence material on which the tissue was 
found, and in certain cases, if human remains have been dismembered, disposed of, or 


Table 18.3 miRNAs identified to be involved in various phases of wound healing. 
miRNA Phase of wound healing References 


miR-21, miR-146a/b, miR- Inflammatory phase Li et al. (2018), Mori et al. 

142, and miR-155 (2018), Roy and Sen 
(2011), Santurro et al. 
(2017) 


miR-21 and miR-132 Inflammatory and Li, Wang, et al. (2015), 
proliferative phase Madhyastha et al. (2012), 
Yang et al. (2011) 
miR-31 and miR-99 Proliferative phase Jin et al. (2013), Li, Li, et al. 
(2015) 
miR-29a,b,c and miR-192 Remodeling phase Cheng et al. (2010), 


Ciechomska et al. (2014) 
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transferred to another location. Currently, no standardized molecular methods exist for 
the identification of organ tissues. Forensic scientists have to rely on immunohistochem- 
ical methods for tissue identification. Such methods require a good amount of tissue sam- 
ples in pristine conditions. But the tissues found in criminal cases are usually degraded and 
only in trace amounts. Some researchers reported promising methods to identify tissue 
organs based on mRNA profiling and DNA methylation levels, but the postmortem 
instability of these molecules renders it difficult to put these methods to practical use. 

miRNAs have been shown to be potential biomarkers for BFID, and therefore their 
potential should be explored for tissue identification as well. In one such research, Sauer 
et al. (2017) found that tissues from the brain, heart muscle, lung, liver, skin, and skeletal 
muscle could be identified based on the expression levels of 15 miRNAs. They also vali- 
dated this study on samples from mock shootings and stabbings. This study identified in- 
dividual miRNAs that were differentially expressed in the different tissues and thus could 
be used for tissue identification. It was noted that a combination of multiplexed miRNAs 
should be used for better inferences. This study provides a strong foundation, and further 
studies should be conducted with an increased sample size and more tissue types (Sauer 
et al, 2017). 


miRNA and drowning 


Drowning cases are commonly encountered during forensic investigations. The two ma- 
jor questions that need to be answered in drowning cases are whether the death really 
occurred due to drowning and whether the drowning was postmortem or antemortem. 
Currently, the methods employed for such investigations include the analysis of diatoms 
in the deceased tissues, electrolyte differences in the blood, pleural liquid, and vitreous 
humor of the deceased, along with the evaluation of aquaporins and some other bio- 
markers. Microenvironmental changes such as oxygen saturation and blood electrolyte 
concentration during drowning can modify the ion-transport system of various cells. 
Certain miRNAs have been reported to regulate the genes that control ion channels 
(Sepramaniam et al., 2010). The microenvironmental changes that happen at the time 
of death can provide lead to miRNA expression patterns and thus aid in determining 
the type of drowning. 

Yu et al. (2015) conducted a study to analyze the miRNA expression pattern in the 
brain tissue of drowning animal models. The findings of this study revealed that there 
were 158 differentially expressed miRNAs, but only eight miRNAs could be established 
as significantly upregulated in freshwater drowning and downregulated in saltwater 
drowning as compared to the control samples. Further analysis showed that one of these 
eight miRNAs, “miRNA-706,” is involved in the regulation of the HCN1 cation chan- 
nel. Thus, miR NA-706 could be a biomarker to differentiate between freshwater and 
saltwater drowning (Yu et al., 2015). Although this study has shown the fundamental 
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behavior of a few miRNAs during drowning, this is the only reported study in this area. 
Therefore, more research is required to find candidate miR NA biomarkers applicable to 
drowning cases. 


miRNA and sepsis 


Sepsis refers to a life-threatening condition that is caused by a poorly regulated inflam- 
matory response of the immune system. It has a high mortality rate (Dellinger et al., 
2013). The pathogenesis of sepsis is still not well understood. The postmortem diagnosis 
of sepsis is difficult as there is no specific histological evidence found (Tsokos, 2007). 
miRNAs have a key role in the regulation of genes responsible for inflammatory re- 
sponses, and therefore researchers have suggested the use of miRNAs as sepsis bio- 
markers. Reithmair et al. (2017) have proposed a set of 180 miRNAs for the precise 
assessment of sepsis-related deaths. They analyzed blood samples of 22 human patients 
with sepsis along with 23 control samples. The miRNA expression profiles were inves- 
tigated in exosomes, serum, and blood cells. The findings revealed that there were 103 
miRNAs with higher levels of expression and 77 miRNAs with lower levels of expres- 
sion in sepsis patients as compared to the control samples. Further, the authors identified 
miR-199b-5p as a potential early biomarker that can be useful in distinguishing patients 
with sepsis (Reithmair et al., 2017). 

It has also been observed that circulating miRNAs could signify the presence of 
particular pathogens. Identification of particular pathogens can help in the investigation 
of sepsis negligence cases. Seven miRNAs have been reported to be correlated with 
infection due to Staphylococcus aureus, and eight miRNAs have been reported to be corre- 
lated with infection due to Escherichia coli (Wu et al., 2013). In addition to these patho- 
gens, approximately 10 other pathogen infections have also been investigated for 
differential miRNA profiles (Rocchi et al., 2021). 

A clearer view of the application of miR NA in sepsis assessment can only be obtained 
after thorough research with a larger sample size has been conducted in this area. 


Other potential future application 


No doubt, miRNAs are the newfound love of biomedical as well as forensic researchers, 
but most of the forensic research in this area is at the preliminary stage only. A number of 
potential forensic applications of miR NAs have been suggested, and research is going on 
most of them. Other than the aforementioned areas of forensics, miRNA can be a po- 
tential tool for the investigation of fire injuries. Some interesting findings have been re- 
ported that suggest that miRNAs can be used as a promising biomarker to distinguish 
antemortem and postmortem burns (Lyu et al., 2018). As the administration of any 
drug can alter the expression profiles of miRNAs, miRNAs can be used for antidoping 
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purposes. Various authors have investigated the correlation of miRNAs with drug and 
substance abuse and found promising results (Rocchi et al., 2021). But extensive studies 
with larger cohorts are required to find specific miRNA signatures related to drug abuse. 
Further, the age-related changes in miRNA expression profiles can be used to determine 
the age of the person whose biological sample is under analysis. Similarly, the time since 
the deposition of a biological stain can also be determined based on miRNA profiling 
(Glynn, 2020). Meticulous future research is required to investigate these promising 
forensic applications of miRNAs. 


Potential limitations 


The routine application of miR NA-based techniques in forensic labs is still debatable, as 
this is an emerging field that needs a lot of standardization. The PCR-based techniques 
involve the complexity of appropriate primer design, and the cross-reactivity of the 
markers puts a limit on the routine use of this promising approach. The lack of agreement 
in the published literature further makes it difficult to bring this approach into routine 
use. It can be due to differential methodological as well as analytical approaches being 
used by various researchers across the globe. Further, the commercial unavailability of 
bio-analytical kits specifically designed for forensic purposes also poses a difficulty for 
the researchers (Glynn, 2020). Another important issue in miRNA analysis is the statis- 
tical analysis of the data. It is yet to be standardized as to which statistical approach is best 
suited to determine whether two datasets are significantly different or not (Rocchi et al., 
2021). 


Concluding remarks 


The forensic techniques that once seemed advanced are becoming antiquated in front 
of the newer challenges. Modern problems require modern solutions, and miRNA is 
one such solution to a lot of investigation problems encountered by forensic scientists. 
Although a major focus of miRNA research across the globe has been limited to its 
biomedical and clinical applications, recent literature has suggested a plethora of appli- 
cations in the forensic field. Being recently recognized and still in its emerging state, the 
research on the forensic applications of miRNA is limited, but it’s growing continu- 
ously. Detailed research is required on each application of miRNA to devise robust 
techniques that can produce precise results acceptable to forensic scientists as well as 
the court. 
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NGS approaches for microbial profiling of soil and water 


Millions of DNA profiles have been collected and analyzed in support of forensic cases; 
most of these have been for human identification. However, a DNA profile does not 
reveal the circumstances or location associated with the transfer of biological material. 
DNA analysis can be used to identify suspect(s) and link their presence or actions to 
the victim(s) and the scene of the crime. Pet and other nonhuman DNA, such as that 
from insects, pollen, and vegetation has been instrumental in solving cases. Microbes 
have been central to solving forensic cases as they have been used as biological weapons, 
but more commonly they are spread from crime scenes through contact with individuals, 
objects, and environments. Microbes are ubiquitous in soil, water, on the human skin, 
and in the human gut, but samples vary in the species and relative abundance of the 
microbe populations present due to environmental conditions (Mishra et al., 2023). 

The use of microbes in forensic science is still in its infancy. Microbes can include in- 
sects, archaea, bacteria, viruses, and fungi (Allwood et al., 2020; Santiago-Rodriguez & 
Cano, 2016). There are different aspects of microbes that can be of interest in a forensic 
setting, such as using microbes found in soil or water to help pinpoint where a crime was 
committed, if a suspect can be associated with a place, circumstances of drowning cases, 
or to determine the microbial agent in bioterrorism investigations (Budowle et al., 2014; 
Giampaoli et al., 2014; Oliveira et al., 2018; Wang et al., 2020). The effect of environ- 
mental conditions on the DNA profile has the potential to be useful for estimating the 
time of a crime using the necrobiome (Kumari et al., 2022; Oliveira et al., 2018; Young 
et al., 2014). Microbes in soil and water can be profiled to solve crimes. Given the soil 
composition complexity in chemical and biological terms, the inherent degree of genetic 
information could yield a highly site-specific fingerprint in forensic investigations 
(Concheri et al., 2011). 

In particular cases, the soil particulates collected from victims or suspects can be used 
in predictive geolocation to predict the location of the crime (Pirrie et al., 2017). If the 
particulates found on a suspect’s shoe or vehicle match those of the scene, the individual 
can be associated with the scene (Kuiper, 2016; Pirrie et al., 2017). A comparison of soil 
features can be of great significance in a forensic case. Previously, soil profiling was 
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focused on identifying the different minerals and inorganic compounds in the soil to 
analyze the color, texture, particle size, and density (Concheri et al., 2011; Sangwan 
et al., 2020). Morphology, including visual appearance, color, particle size distribution, 
composition, and texture, can be analyzed using optical microscopy, scanning electron 
microscopy, X-ray diffraction (XRD), laser ablation-inductively coupled plasma mass 
spectrometry, transmission electron microscopy, and stable isotope analysis (Pirrie 
et al., 2017). Giampaoli et al. (2014) compared the use of next-generation sequencing 
(NGS) to geologic analysis of soil (color, XRD crystalline phases, and mineral abundance) 
and analyzed bacteria, eukaryotes, and plant material present in the soil to differentiate six 
soil samples by principle component analysis. In one case, soil evidence and geolocation 
were used to draw a confession from the suspect (Guo et al., 2021). Soil was found on the 
suspect’s clothing and in the area where the victim’s body was discovered (Guo et al., 
2021). Analysis methods were able to identify different minerals, diatoms, elements, 
and pollen spores in comparing the dirt from the victim to the suspect (Guo et al., 2021). 

Soil and water differentiation using color and texture can be subjective. DNA can be 
used to analytically differentiate soil and water samples. Previous soil DNA research uti- 
lized denaturing gradient gel electrophoresis, which allowed discrimination using 10 
distinct bands (Muyzer et al., 1993). An alternate method suggested for differentiating 
DNA sequences from samples (including autosomal and mitochondrial DNA) is through 
the use of NGS, a non-Sanger-based DNA sequencing technology (Yang et al., 2014). 

As it has been on the market for almost 20 years, the cost has dropped significantly, 
enabling more forensic labs to adopt NGS technology for use in forensic cases for a va- 
riety of applications (Elkins et al., 2021). Additionally, the high throughput is unparal- 
leled: the sequencing reactions can be performed on multiple samples and target 
regions simultaneously. Finally, no capillary or gel electrophoresis is needed as the se- 
quences are read directly during sequencing by synthesis steps. NGS has been applied 
to microbial forensics cases, including detecting and differentiating Bacillus anthracis 
(anthrax) and Yersinia pestis (plague) in samples and evaluating flora and fauna in environ- 
mental samples (Kuiper, 2016; Yang et al., 2014). 

Early research studied the bacterial diversity ofa soil sample (Stackebrandt et al., 1993) 
but did not compare several locations, including crime scenes. NGS can be used to analyze 
a wider range of microbiological diversity (Giampaoli et al., 2018). Ina more recent study 
using the 18S rRNA gene, soil from 11 environments (including forest, grassland, fields, 
and an urban park) was compared using NGS on the Roche 454 platform (Lilje et al., 
2013). Whole genome amplification (WGA), DNA sequencing, and metagenomic classi- 
fication were used to differentiate samples from two residential parks that were similar in 
color (Khodakova et al., 2014). The indoor microbiome was investigated in houses in Ger- 
many both spatially and temporally (W eikl et al., 2016). The fungal composition was influ- 
enced more by the outdoor vegetation and particulates than the bacterial composition, but 
both changed over time (Weikl et al., 2016). A more recent study reported that the soil 
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location is more important than the soil type of the sample (Habtom et al., 2019). The 
microbiome changes from location to location even if the soil type remains the same (Hab- 
tom et al., 2019). With microbial DNA profiling being a new approach in forensic inves- 
tigations, validation of kits and methods is necessary for their implementation in a forensic 
laboratory (Kuiper, 2016). There is still much work that needs to be done before microbial 
forensics is widely used for investigative purposes. In this chapter, we will review the cur- 
rent kits and technology available for use. 

The first step in the analysis of microbial DNA is sampling, followed by DNA extrac- 
tion. The samples must be systematically extracted and stored in a cold place so that the 
microbe representation in the sample does not change. The extracted DNA is quantified 
to determine if sufficient samples have been isolated, and a quick polymerase chain reac- 
tion (PCR) assay is used to determine if PCR inhibitors such as humic acid have been 
eliminated in the extraction process and if the samples amplify prior to NGS. Conserved 
regions, such as the 16S rRNA gene are used for this application. Finally, the samples are 
sequenced to determine the identity of the microbe(s). The 16S rRNA gene can be tar- 
geted as it has nine hypervariable (HV) regions that can be used to identify bacteria by 
comparison to databases, and the constituent species or phyla can be analyzed by meta- 
genomics. Alternatively, a sample can be sequenced and identified using whole genome 
sequencing (WGS), which is limited to known species (Laurence et al., 2014). 


Extraction methods for obtaining microbial DNA from soil and water 


When analyzing any sample whether it be from a human subject, an article of clothing, 
soil, or even samples in a bioterrorism case, the extraction of the DNA from the sample is 
the critical first step. Steps, including UV irradiation, must be taken to ensure that there is 
no contamination (Laurence et al., 2014). Every accredited forensic laboratory has stan- 
dard operating procedures (SOPs) in place to avoid contamination, and the validation of 
methods ensures that any instruments and methods being used in the DNA recovery pro- 
cess are reliable and robust. 

In soil samples, the first step of the extraction process is the most important. To prop- 
erly extract the DNA from the soil, the analyst must lyse all of the cells to ensure that 
DNA is not washed away with the soil debris in subsequent steps. There are two ap- 
proaches to soil DNA extraction: In indirect extraction, the cells are separated from 
the soil prior to lysing, whereas the cells are lysed open in the presence of soil debris 
in direct extraction. When possible, direct extraction is the preferred method as it in- 
volves fewer steps, which correlates to fewer chances for contamination or sample loss. 

One of the considerations when trying to decide upon an extraction method or kit is 
that not all available kits have been tested for all soil types. The kit can greatly affect DNA 
extraction yields and efficiency, and some kits may overrepresent or underrepresent 
certain microbe species within the sample. Ensuring the kit used is optimal for the soil 
type is crucial. A list of available soil extraction kits is listed in Table 19.1. The input 
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Table 19.1 Commercially available kits for soil DNA extraction. 


Kit name 


DNeasy PowerSoil 
Pro Kit 


DNeasy PowerMax 
Kit 


DNeasy UltraClean 
Microbial Kit 


SoilMaster DNA 
Extraction Kit 


Company 


QIAGEN 


QIAGEN 


QIAGEN 


Epicenter 


Soil Mass 


<0.25 g 


<10g 


<0.5 g 


0.10 g 


Cell lysis/physical 
treatment method 


Bead-beating 
(different bead 
sizes and types 
available) 


Bead-beating 
(different bead 
sizes and types 
available) 


Bead-beating 
(ceramic or glass 
beads are 
available), heat, 
and detergent 

Hot-detergent lysis 
proteinase K 


Elution volume 


50-100 pL 


5 mL 


50 pL 


300 pL 


Comment(s) 


Effective at removing 


PCR inhibitors 
from even the most 
difficult soil types. 
This is an option to 
do a 96-well 
extraction. 


Utilizes their patented 


Inhibitor Removal 
Technology (IRT) 
making it possible 
to process samples 
that have in the past 
proven difficult due 
to high levels of 
humic-like 
substances. 


Can be automated 


with the option to 
do 4—96 well plates 
at a time. 


Utilizes a 


chromatography 
step to remove 
enzymatic 
inhibitors known to 
coextract with the 
DNA. 
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Soil DNA Isolation 
Kit 


NucleoSpin Soil Kit 


SurePrep Soil DNA 
Isolation Kit 

E.Z.N.A. Soil DNA 
Kit 


FavorPrep Soil DNA 
Isolation Kit 
(Mini/Midi) 


Norgen Biotek 0.25 g Bead-beating 
Macherery- <0.5 ¢ Offers two 
Nagel alternative lysis 
buffers to suit soil 
sample, bead- 
beatings (ceramic 
beads) 
Fisher 0.25 g Bead-beating 
BioReagents 
OMEGA Biotek | 0.1—0.25 g Incubation bead 
beatings (glass 
beads) 
FAVORGEN Mini: 0.25—0.5 g | Incubation bead 


Biotech Corp Midi: <10 g beatings (glass 
beads) proteinase 


K 


50—100 wL Removes all traces of 
humic acid and 
PCR inhibitors 
with an OSR 
(organic substance 
removal) solution. 

For high molecular 
weight genomic 
DNA from 
microorganisms, 
the NucleoSpin 
inhibitor removal 
column helps to 
remove inhibitors 
and humic acid 
from the sample. 


30-100 pL 


50 wL Removes all humic 
acid. 
50—100 wL Unique inhibitor 


removal reagent 
(CHTR reagent) for 
effective removal of 
humic acid and 
other PCR 
inhibitors from 
eluted DNA. 

Soil in lysis buffer is 
incubated at 70°C 
for 10 min. 


Mini: 50—200 pL 
Midi: 1—2 mL 


Continued 
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Table 19.1 Commercially available kits for soil DNA extraction.—cont’d 


Kit name 


FastDNA SPIN Kit 


prepGEM Bacteria 


ZymoBIOMICS 
DNA Miniprep 
Kit 

PureLink 
Microbiome DNA 
Purification Kit 


Company Soil Mass 


MP Biomedicals 


MicroGEM 


ZYMO 
Research 


Invitrogen 


Cell lysis/physical 
treatment method 


Cell lysis buffer and 
the use of the 
FastPrep 
instrument to 
disrupt cell walls 

Temperature- 
driven enzymatic 


lysis 
Bead-beating 


Heat, chemical, and 
mechanical 
disruption (bead- 
beating) 


Elution volume 


50-100 pL 


100 pL 


100 wL 


50-200 pL 


Comment(s) 


Delivers the highest 
DNA yield based 
on citations. 


Green + buffer is 
needed. 


Uses its BashingBeads 
to break apart 
tough microbes. 

Typical DNA 
recovery is 0.5 
—15 pg from 0.2 g 
of soil sample. 
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quantity ranges from 0.05 to 10 g. Most of the kits employ bead-beating steps, while 
some involve heating and enzymatic digestion steps. Some of the methods can be auto- 
mated, but most are manual extraction methods. The PowerSoil DNA Isolation Kit is 
recommended for extracting high-quality DNA from soil samples and removing PCR 
inhibitors; soil can be a difficult substrate from which to recover DNA (Young et al., 
2014). 


Quantitation methods for microbial DNA 


Human identification methods typically employ a quantitation step to ensure that an 
appropriate quantity of DNA is input for targeted PCR prior to capillary electrophoresis 
separation or sequencing. Microbe kits can be used for quantitative PCR (qPCR) quan- 
titation. An example is the Qiagen Microbial DNA qPCR Assay Kit. Alternatively, 
nonspecific DNA quantification methods, including UV-Vis spectroscopy and fluores- 
cence spectroscopy can be used. qPCR methods can be sensitive to picomolar concen- 
trations, while fluorescence is sensitive to nanomolar concentrations, and UV-Vis 
spectroscopy is sensitive to micromolar concentrations. The NanoDrop and Qubit in- 
struments are rapid and easy-to-use UV-Vis and fluorescence spectrometers, respectively. 
The Qubit requires a few preparation steps with consumables that the NanoDrop does 
not, but it has higher sensitivity. 


NGS methods for microbial DNA analysis 


NGS technology, including the MiSeq instrument by IJumina, has been developed to be 
used to analyze multiple loci and samples in a single run at a relatively low per-sample cost 
with an extremely high depth of coverage for microbes and the ability to detect low 
enrichment species. Although DNA methods have been applied to soil discrimination 
in the past, the use of NGS to probe 16S and 18S rRNA gene HV region amplicons 
for soil microbe profiling can lead to more detail. NGS can be used to analyze the micro- 
biome for forensics (Zhang et al., 2022). NGS can discriminate between samples that 
have alleles of similar lengths and regions with single nucleotide polymorphisms and 
can be used with metabarcoding to differentiate samples that are geologically very similar 
but come from different environments (Fadrosh et al., 2014; Young et al., 2016). Soil 
biodiversity can be analyzed using environmental DNA (eDNA) using NGS metabar- 
coding (Froslev et al., 2022). There are optimized NGS analysis workflows for 16S meta- 
genomics, typically using the V3 and V4 HV regions but also using the full 16S region 
(Fadrosh et al., 2014; Lu & Salzberg, 2020). The 16S method enables taxonomic classi- 
fication of the microbial communities in the samples. 
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Iumina instruments employ sequence-by-synthesis (SBS) chemistry; single bases are 
detected as they are incorporated into DNA strands in a massively parallel process 
(Kircher & Kelso, 2010). Fluorescent terminator dyes are imaged as each nucleotide 
(dNTP) is added and then cleaved to allow incorporation of the next base in a reversible 
terminating process. Since all four reversible, terminator-bound dNTPs are present, 
incorporation bias is minimized. Plug-and-play reagent cartridges with radio-frequency 
identification (RFID) tracking are used to reduce the complexity of use and maintenance 
(Kircher & Kelso, 2010). After sample preparation (extraction, quantitation, PCR for 
adding adaptors and indices, and NGS library purification), the length of the sequencing 
process is approximately 7.5—8 h (4 h for clonal amplification, sequencing, and quality- 
scored base calling, and 3 h for sequence alignment). 

The first step in Illumina sequencing using NGS is library preparation, which results 
in amplified, tagged, and barcoded PCR products. The first PCR step in library prepa- 
ration is to amplify and tag the target region. The second round of PCR adds a unique 
barcode to each amplicon and adapters to bind the Illumina sequencing flow cell 
(Fadrosh et al., 2014). PhiX control can be used to verify that library prep has been 
completed correctly. PCR products can be purified using the Agencourt AMPure XP 
magnetic PCR purification system, quantified using a Qubit fluorescence quantification 
system, and sized with a QIAxcel DNA fragment sizing system or agarose gel. The pu- 
rified PCR products are normalized before pooling for simultaneous sequencing of 
numerous targets and tens of samples in parallel on an [lumina instrument. 

NGS data analysis includes taxonomy of soil bacterial species identified, number of 
read base pairs, total number of reads, median read length, and mean read length. 
Several bioinformatic software tools are used. Phylogenetic and microbial relation- 
ships can be determined using metagenomics (Kirubakaran et al., 2020). A barcoding 
approach for DNA from plants, microflora, metazoa, and protozoa was developed and 
applied to forensic soil samples (Giampaoli et al., 2014). A semiautomated approach for 
NGS metabarcoding for fungal analysis was recently published (Giampaoli et al., 2020). 
The newly created forensic microbiome database tool collects 16S rRNA data from 
publicly available data with its associated metadata and can be used to interpret micro- 
biome data for forensic geolocation (Singh et al., 2021). 


Application of microbial DNA analysis in geolocation in the United 
States 


Soil studies for forensic geolocation have been conducted in Germany and Australia. 
We have collected and are analyzing the soil microbiome signatures of soil collected 
from US states and urban and rural environments. The location of each site has been 
recorded, and the soil DNA has been extracted using several kits, quantitated using a 
NanoDrop spectrophotometer, and amplified using PCR with primers targeting the 
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16S rRNA gene (Klindworth et al., 2013). The samples are undergoing 16S library 
prep to selectively amplify the V3/V4 region of bacteria and archaea for analysis of vari- 
ation among the locations and to identify distinguishing species in each location. 


Ongoing challenges and conclusion 


Microbial DNA profiling using NGS is being applied to a variety of applications, 
including associating a suspect to a crime scene, geolocation, and identification of bioter- 
ror agents (Guo et al., 2021; Haarkétter et al., 2021; Kuiper, 2016). NGS has made mi- 
crobial profiling less expensive and more accessible. Limitations of microbial DNA 
profiling from soil and water include the completeness of searchable databases for com- 
parison and species identification, the availability of standardized library prep kits, stan- 
dardized protocols, analysis toolkits for forensic applications, navigating coextracted 
contaminants, sequencing prior to precloning, and the capability of the sequencer to 
perform deep sequencing. Lab SOPs will need to be updated to include microbial 
profiling and standardized workflows; analysis pipelines, data storage servers, database ap- 
plications, and reporting will also need to be addressed (Budowle et al., 2014; Jurkevitch 
et al., 2020; Sjédin et al., 2012). 
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Introduction 


DNA typing from biological material recovered from a crime scene and human remains 
for human identification has proven to be a difficult task for forensic geneticists due to 
various factors that influence the success or failure of testing. The biological evidence 
that forensic genetics experts get for analysis is not always the most suitable type of sample 
for forensic DNA analysis, especially in terms of quantities (Whitaker et al., 2001; Chung 
et al., 2004; Dixon et al., 2006; Gill et al., 2007). The investigation of ancient DNA also 
presented many challenges in obtaining enough template DNA quantity to test (Llamas 
et al., 2017). 

Conventional techniques requiring the typing of short tandem repeat (STR) markers 
using capillary electrophoresis (CE) frequently fail in these cases. A possible explanation 
for this might be the length of the amplified fragment or the signal of imbalance observed 
between the short and long tandem repeats (Chung et al., 2004; Dixon et al., 2006). 

DNA typing through STR analysis has become one of the most well-known tools 
and the gold standard for human identification (Jordan & Mills, 2021). A major advantage 
of this approach is the standardization of loci utilized by all laboratories and the extremely 
large searchable database of genetic profiles. However, some limitations and challenges 
are faced during the DNA typing of highly degraded or low template DNA samples. 
Alternative approaches have been developed to overcome these limitations in an attempt 
to improve the chances of acquiring a genetic profile from such challenging samples, 
particularly useful in studying mitochondrial DNA (mtDNA), single nucleotide poly- 
morphisms (SNPs) (Brenner & Weir, 2003), insertion/deletion (indels) polymorphisms 
(Pereira et al., 2009), and mini-STRs (Holland et al., 2003). 

CE-based methods were and are the standard method used in forensic analysis and 
were selected for their reliability and validity. However, the situation necessitates the 
search for advanced ways to aid the analysis of the challenging DNA evidence. 

The introduction of novel techniques and technologies into the criminal justice sys- 
tem is slow, but new methodologies are being developed that enable the analysis of these 
challenging samples and that can generate intelligence about the donor of a biological 
sample (McCord et al., 2019; Butler & Willis, 2020). Massively parallel sequencing 
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(MPS) or next-generation sequencing (NGS) could overcome the limitations of the pre- 
vious technologies used to analyze DNA evidence and have brought numerous solutions 
to assess degraded DNA evidence, low-template DNA, DNA mixtures, and ancient 
DNA. 

This chapter highlights the types of challenging DNA samples, the previous technol- 
ogy used to analyze them, the important application of NGS, and the available gene 
panels that have been utilized for the abovementioned evidence. 


Application of next-generation sequencing technology in forensic 
science 


The use of NGS for forensic applications has expanded rapidly in the last few years, and its 
incorporation into forensic workflows is becoming a realistic choice (Kayser & Parson, 
2018; de Knijff, 2019). NGS has been chosen mainly due to the continuous decrease 
in costs, the implementation of initial PCR-amplification of a set of target markers fol- 
lowed by MPS of the resulting amplicons, and the development of bioinformatics tools 
that enable the analysis of the large volume of complex data produced (Alonso et al., 
2018; Bruins et al., 2018). 

One of the major advantages of NGS methods is the ability to incorporate a large 
number of different marker types into a single assay, as well as shorter amplicons 
compared to conventional STR profiling methods, improving the results, power of 
discrimination, and robustness to analyze damaged and limited DNA input samples 
(Shih et al., 2018; Zhang et al., 2018; Borsuk et al., 2018; Amankwaa & McCartney, 
2021). Another significant benefit is the capability to detect nucleotide sequence varia- 
tion in the targeted markers, such as variants in STR repeat regions and flanking 
sequences, which is useful in complex mixture profile interpretation and therefore en- 
ables the discrimination of alleles that would be indistinguishable using CE-STR-based 
typing (Pereira et al., 2018; Novroski et al., 2019). These developments explain how 
NGS technology continues to impact forensic science positively. 

Despite the success of the NGS technologies, they are not yet widely used in 
casework in other areas of forensic analysis that are indispensable tools for the forensic 
community such as RNA sequencing, epigenetics, and forensic DNA phenotyping 
(Yang et al., 2014). 


Panels developed for the forensic samples using the NGS 
technologies 


Several NGS-based sequencing platforms, also called second-generation sequencing, 
have been developed by Roche Life Sciences, Thermo Fisher Scientific, [lumina, and 
Applied Biosystems, including pyrosequencing, Ion Torrent technology, the [lumina/ 
Solexa platform, and SOLiD (Diegoli et al., 2012). Table 20.1. 


Table 20.1 Panels developed for the forensic samples using the NGS technologies. 


Platform 


Ion gene studio S5 system, Ion 


PGM system, Ion S5 XL 
system, Ion OneTouch 2 
system, Ion Chef system 


Application 


Analyze highly degraded or trace 


DNA 

Generate more investigative 
leads in development 
Analyze familial and kinship 
cases 

Analyze missing persons and 
remains 

Analyze DNA mixtures with 
more efficiency 


Kit 


Precision ID mtDNA Whole 
Genome Panel 


Precision ID mtDNA Control 
Region Panel 


Precision ID GlobalFiler NGS 
STR Panels 


Precision ID Ancestry Panel 


Precision ID Identity Panel 


Ion AmpliSeq VISAGE-Basic 
Tool Research Panel 


Ion AmpliSeq DNA 
Phenotyping Panel 


Markers 


Analyze the entire human 
mitochondrial genome 
(16,569 bp). 

It is a 2-pool multiplex assay that 
targets the 1.2 kb control 
region of the human 
mitochondrial genome, which 
encompasses HV-I, I, and III. 

The panel targets 35 markers 
comprised of 21 CODIS STR 
markers, 9 additional 
multiallelic STR markers, and 
4 sex determination markers. 

The panel includes 165 
autosomal markers (SNPs). 

The panel includes 34 upper Y- 
clade SNPs and 90 autosomal 
SNPs. 

Panel consists of 153 SNPs, 
allowing prediction of EVCs 
and biogeographical ancestry 
and incorporating 41 SNPs of 
the HIrisPlex-S assay1 and 
115 BGA markers chosen for 
differentiation of seven 
continental populations. 

This panel targets 23 SNPs and 
one indel from the following 
genes: MC1R, SLC45A2, 
ASIP/PIGU, EXOC2, 
HERC2, IRF4, KITLG, 
OCA2, TYR, SLC24A4, and 
TRYPI1. 


Continued 
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Table 20.1 Panels developed for the forensic samples using the NGS technologies.—cont’d 


Application 


Platform 


Illumina (MiSeq) Forensic DNA testing for 


criminal casework 


mtDNA analysis to identity 
missing persons 


Disaster victim identification 


Kit 
Ion AmpliSeq PhenoTrivium 
Panel 


Ion AmpliSeq HID Y-SNP 
Research Panel v1 


Ion AmpliSeq MH-74 Plex 
Research Panel 


ForenSeq DNA signature prep 
Kit 


Verogen ForenSeq mtDNA 
Control Region Kit 


Verogen ForenSeq mtDNA 
Whole Genome Kit 


Combining the ForenSeq DNA 
signatureand the Nextera XT 
DNA Library prep Kits 


Markers 


Contains ancestry, DNA 
phenotyping, and male 
lineage markers for a total of 
200 autosomal SNPs and 120 
Y chromosomal SNPs. 

The panel enables analyses of 
859 Y-SNPs to infer 640 Y 
haplogroups. 

The Panel is a 157—325 bp assay 
covering 74 microhaplotypes 
(230 SNPs) selected from a set 
of 130 microhaplotypes. 

230 genetic markers: 27 global 
autosomal STRs, 24 Y-STRs, 
7 X-STRs, 95 identity SNPs, 
22 phenotypic SNPs, and 56 
biogeographical ancestry 
SNPs. 

Analyze the mtDNA D-loop 
hypervariable region. 

For difficult samples and 
amplify four overlapping 
fragments spanning the 
hypervariable regions. 

For clean, intact samples and 
amplify two long fragments 
spanning the entire human 
mitochondrial genome 
(16,569 bp). 


06€ 
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Verogen ForenSeq MainstAY 
Kit 


For long range kinship and 
genetic genealogy use. 

Dedicated QIAseq panels for 
human identity investigations. 


Verogen ForenSeq K 
intelligence Kit 

QIAseq investigator Panels 
(https://www.qiagen.com/ 


us/products/human-id-and- 


forensics/nextgeneration- 


sequencing/qiaseq- 
investigator-panels/) 


For targeted sequencing of 
established autosomal and Y- 
STR markers. 

Simultaneous analysis of 52 
genetic markers. 

ForenSeq MainstAY Kit: 27 
global autosomal STRs 
including, CODIS and the 
European standard set of 25 
Y-STRs, including the 
minimum set for the Y 
Haplotype Reference 
Database (YHRD). 
ForenSeq MainstAY SE Kit: 
Includes all markers in 
ForenSeq MainstAY 

Kit +SE33. 

Simultaneous analysis of 10,230 
SNPs. 


Qiagen. (2022). “Next-Generation Sequencing for HID.” Retrieved January 10, 2023, 2023, from https:// www.qiagen.com/us/product-categories/human-id-and-forensics/ 


nextgeneration-sequencing. Thermo Fisher Scientific. (2022). “Precision ID NGS System for Human Identification: Targeted Sequencing for Forensic DNA Analysis.” from https:// 


www.thermofisher.com/sa/en/home/industrial/forensics/human-identification/forensic-dna-analysis/dna-analysis/next-generation-sequencing-ngs-forensics.html. 
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Furthermore, continuous optimization has resulted in innovative third- and fourth- 
generation platforms such as PacBio’s single-molecule real-time sequencing, nanopore 
sequencing, etc. Since 2012, these available instruments have significantly increased 
the number of sequenced genomes and published genome-based studies (Barba, n.d., 
pp. 149-169; Braslavsky et al., 2003; Harris et al., 2008). 

Similarly, a wide range of validated and/or evaluated kits targeting various forensically 
relevant markers are currently available for use on each platform (Wang et al., 2017; 
Alonso et al., 2018; Ballard et al., 2020). These are especially: the Precision ID Global- 
Filer NGS STR Panel and Precision ID Identity/Ancestry Panels, using the Ion S5 sys- 
tem to sequence STR and SNP markers (Meiklejohn & Robertson, 2017; Montano 
et al., 2018); Promega’s PowerSeq system that sequences autosomal and Y-STRs using 
the Illumina MiSeq platform (Guo et al., 2017) and the Verogen ForenSeq DNA Signa- 
ture Prep Kit, designed to work with the MiSeq FGx Forensic Genomics System, which 
uses Illumina technology to sequence autosomal STRs, Y-/X-STRs, and identity SNPs, 
and can be expanded to include phenotype- and ancestry-informative SNPs (Moreno 
et al., 2018; Pereira et al., 2018). 

Several Kits have also been developed to target part or all of the mitochondrial 
genome, such as the Precision ID mtDNA Whole Genome Panel that amplifies the 
entire mtDNA genome using the Ion AmpliSeq technology, Nextera XT library prep- 
aration, and Illumina sequencing on the MiSeq FGx system, as well as the Verogen Fore- 
nSeq mtDNA Control Region Kit also using the MiSeq FGx Sequencing System (Peck 
et al., 2018; Pereira et al., 2018; Holt et al., 2019). 


DNA damage and degradation 


Human biological specimens submitted for forensic DNA analysis are rarely unaffected 
by exposure to their environment, and consequently, small amounts of DNA, which 
are generally highly damaged are frequently extracted from this kind of sample. Several 
factors such as heat, humidity, pH of the aqueous environment, ultraviolet (UV) light, 
chemical agents, and enzymatic attack can contribute to the damage that samples 
collected from crime scenes, mass disasters, and mass graves’ locations can cause, and it 
is difficult to determine the extent of damage when the DNA of interest is present in 
a sample containing DNA from multiple sources (Hall & Ballantyne, 2004). 
Commonly, the DNA is damaged through various endogenous processes such as 
oxidation, hydrolysis, pyrimidine dimers, cytosine deamination, DNA-DNA cross- 
linking, and DNA-Protein cross-linking (Diegoli et al., 2012). The most common forms 
in forensic samples are oxidation, hydrolysis, and pyrimidine dimers. The oxidative dam- 
age is mediated by free radicals and leads to the creation of modified oxidative base prod- 
ucts that prevent the DNA replication process during PCR by inhibiting the standard 
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Taq DNA polymerase to bypass the lesions due to the lack of 3’ exonuclease activity 
(Evans et al., 1993; Lindahl, 1993). 

It has been reported that hydrolysis and oxidation still take place even in the solid- 
state DNA and can only be avoided if the biological evidence is protected from atmo- 
spheric water and oxygen, which is actually difficult, especially since these conditions 
are commonly encountered in cases of natural and mass disasters (flooding, tsunamis, 
waterlogged burial sites, etc.) (Colotte et al., 2011). 

Additionally, photochemical reactions induced by UV radiation generated the forma- 
tion of covalent linkages by localized reactions on the C=C, or T=T double bonds, 
which stall DNA polymerase and arrest replication during PCR (Goodsell, 2001). 
UVC radiation-induced DNA damage makes forensic DNA analysis partially or 
completely nonfunctional by affecting PCR efficiency and, subsequently, DNA profile 
identification (Grskovié et al., 2013). 

The passage of time and the improper conditions of storage and processing of biolog- 
ical materials also amplify the risk of DNA degradation. This could be due to the obsti- 
nate microbial fauna and flora growth, which are also common contaminants in skeletal 
material recovered from a crime scene or a mass disaster (Van der Schans, 1978). Micro- 
organism growth has been shown to be a significant cause of DNA damage leading to the 
nontypeability of forensic samples by secreting digestive enzymes that induce DNA dam- 
age via reactive oxygen species and double-stranded breaks that can increase by 25-fold 
(Ballantyne, 2006). The microbial degradation can also lead to inappropriate or no PCR 
amplification as well as the detection of heterozygous peak imbalances (Dash & Das, 
2018). 

While the causes of DNA degradation differ, the final result remains the same: the 
fragmentation of the double helix into small pieces, leading to a significant loss of genetic 
information inadequate for PCR amplification and resulting in incomplete or no DNA 
profiles (Bender et al., 2004; Ballantyne et al., 2007; Fondevila et al., 2008). 

Several pieces of research have revealed that PCR inhibition is the most common 
cause of PCR failure and remains a big challenge in forensic DNA analysis of biological 
remains recovered from the environment (Alaeddini, 2012). These inhibitors, which are 
found in a variety of biological materials (hemoglobin, myoglobin, bile salts), environ- 
mental samples (fulvic acids, humic acids, humic material Metal ions), and food (calcium 
ions, Proteases) (Schrader et al., 2012), may negatively affect cell lysis during extraction, 
primers, or reduce the polymerase activity (Hedman & Radstrém, 2013). 

Overall, regardless of the source, the DNA damage may significantly interfere with 
PCR amplification and therefore decrease the genotyping success, especially when the 
DNA is isolated from a piece of biological evidence discovered at a crime scene that is 
not under homeostatic control, resulting in the accumulation of mutations over time 
and potentially allele drop-outs in DNA profiles (Hall & Ballantyne, 2004). 
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Challenging DNA evidence 


DNA serves various functions in the human body, including coding for proteins. DNA 
also facilitates inheritance and provides a means to explain disparate aspects of life as well 
as its various processes (Travers & Muskhelishvili, 2015). In addition, DNA has proven its 
effectiveness as a powerful tool in criminal investigations (Amankwaa & McCartney, 
2021). However, forensic DNA laboratories encounter many challenges concerning 
DNA evidence due to degradation, the presence of PCR inhibitors, low-template 
DNA samples, and mixture profile interpretation (Butler, 2015). Forensic geneticists 
who are frequently confronted with low- or poor-quality samples are constrained to pro- 
vide partial genetic profiles with allele and/or locus drop-out, equivocal results, or, 
worse, no results. Until now, to resolve these issues, the most effective approach has 
been to increase PCR efficiency by generating shorter amplicons (Constantinescu 
et al., 2012). 

Highly degraded biological evidence significantly reduced the quality of the DNA 
polymer, making effective testing of the sample difficult, especially with the standard nu- 
clear DNA testing protocols, which seem ineffective with severely degraded samples and 
necessitate further methods (Burger et al., 1999). 

When it comes to DNA extraction from biological evidence, ancient DNA (aDNA) 
also presents many challenges, necessitating the use of advanced methods. Contamination 
is one of the issues that scientists face with this type of DNA, which leads to exogenous 
genetic materials (Llamas et al., 2017). The challenges arise because the available DNA in 
the fossil is generally scarce and fragmented due to the environmental microorganisms as 
well as the people who handled the fossils (Slatkin & Racimo, 2016), and any action that 
contaminates the aDNA aggravates the challenge of obtaining reliable evidence 
(Machado & Granja, 2020, pp. 45—56). 

Low-template DNA, or low copy number DNA, is another challenge that forensic 
investigators face during the analysis of DNA evidence due to the collection of shreds 
of samples containing small amounts of DNA (Tan, 2010, pp. 1-45). Traditionally, 
the extraction of DNA from this type of evidence is not successful, and the related genetic 
profiles are sometimes insignificant, making them inadequate for evaluations and, in most 
cases, leading to the nullification of evidence and therefore obstructing those investiga- 
tions (Mateen et al., 2020). 

Another major challenge is the interpretation of DNA mixture profiles containing 
contributions from multiple donors due to the low sensitivity of STR loci to allow 
the detection of the minor donor in the presence of a major donor, not only because 
of the potential number of alleles shown in the profile but also because of the low- 
level profiles with common allele drop-out/drop-in and heterozygous imbalance (Gill 
et al., 2015). 


Applications of NGS in analysis of challenging samples 


The existence of additional donors from the major or minor DNA fraction increases 
the complexity of the mixtures, which has a significant impact on the downstream inter- 
pretation of STR profiles. Recent advances in STR profiling techniques have facilitated 
the investigation and recovery of DNA mixtures, not only from samples where mixtures 
might be expected (e.g., sexual offense samples), but also from low-quality/quantity sam- 
ples recovered from handled items that often produce complex mixtures with large 
numbers of contributors (Kelly et al., 2012; Cavanaugh & Bathrick, 2018). Therefore, 
and due to the increasing complexity of DNA mixture profiles, a significant deviation 
from relatively simple genotyping methods for determining whether an individual could 
be excluded as a potential contributor to a mixture to the use of likelihood ratio tests has 
facilitated the estimation of the most likely genotype combinations of contributors to a 
mixture (Kelly et al., 2014; Bieber et al., 2016). 

All these challenges have led to the development of new methods, especially of DNA 
extraction and PCR approaches and kits, as well as methods of interpretation using prob- 
abilistic frameworks, incorporating probabilities of allele drop-out and drop-in, modeled 
from validation and empirical data (Bright et al., 2017; Coble & Bright, 2019). NGS is a 
well-established approach that could help overcome some of the STR mixture con- 
straints and improve the analytical capabilities of challenging mixture samples (Oldoni 
& Podini, 2019). 


Previous types of technologies used to analyze challenging DNA 


CE remains the most well-known tool for assessing the separation of DNA in forensic 
fields by amplicon length. However, as mentioned previously, the small amounts of 
DNA are a serious challenge that forensic scientists grapple with when collecting evi- 
dence from crime or accident scenes. Degraded human remains also lead to low- 
template DNA, necessitating the need to adopt alternative methods for DNA analysis. 
Different methods have been proposed to assess the challenging forensic DNA samples, 
which are: 


MiniSTRs 


One of the previously used approaches to analyzing challenging DNA samples is min- 
iSTR analysis. This approach resolves the issues of DNA quality and quantity that are 
often a problem when handling degraded samples by employing reduced-size STR 
amplicons for the analysis of low-template DNA (Butler & Willis, 2020). Through min- 
iSTRs, scientists can recover the needed information from the small samples in their 
possession (Mautner et al., 2017). 

PCR is the mechanism by which miniSTR operates to amplify DNA regions that are 
highly polymorphic and repetitive (Jordan & Mills, 2021). By designing primer binding 
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sites that are closer to the target regions, which reduce the amplicon size to 80—150 bp, 
miniSTR analysis can be used for degraded DNA samples (Butler et al., 2003; Dixon 
et al., 2006; Opel et al., 2007). 

However, there are certain drawbacks associated with the use of miniSTRs, especially 
that while bringing the primers closer to the repeat region reduces overall amplicon size, 
loci with the widest range of alleles, such as D21S11 and FGA, that remain have large 
allele size differences, leading to a difference in PCR performance between the extreme 
repeat numbers (Gill, 2001). 


Mitochondrial DNA typing 


The fact that mtDNA does not change from the mother to their offspring explains its 
wide application in forensic investigations. This type of inheritance makes this small 
genome essential to the determination of a person’s ancestry as well as the analysis of 
degraded or low-copy-number DNA (Mishmar et al., 2019). The analysis of mt DNA 
sequence polymorphisms has proved useful when nuclear DNA is too degraded since 
mtDNA is more likely to be present in forensic biological samples in high copy numbers. 
This is evident in the cases of telogen hair roots, hair shafts, bones, teeth, and touch DNA 
(Budowle et al., 1999; Parson et al., 2014). 

The effectiveness of mtDNA and its vitality for DNA evidence analysis have been re- 
ported in its resistance to environmental insults. There have been several investigations 
analyzing the whole mtDNA rather than just the hypervariable polymorphic regions 
HVI/HVII to increase the power of discrimination of the mtDNA and to resolve the 
problem of common sequences found in the HVI/HVII regions shared by populations 
of the same origin (Templeton et al., 2013; Zavala et al., 2019; Budowle et al., 2004; 
Alvarez-Iglesias et al., 2007). 

Until recently, the only approach utilized as an alternative to standard STR genotyp- 
ing for highly degraded DNA was mtDNA sequencing following the standard protocols. 
However, there are certain difficulties with the use of mtDNA sequencing, mainly 
because it’s a laborious, time-consuming, and costly approach for the operational labo- 
ratory. Therefore, only a portion of forensic DNA laboratories have implemented 
mtDNA analysis (Budowle & Van Daal, 2008). Another advantage of using mtDNA 
typing in casework is that the observed power of discrimination is not as high as that 
provided by the STR loci, a consequence of the matrilineal inheritance rules in certain 
relationship comparisons such as father and daughter as well as the urgent need of large- 
scale haplotype databases to interpret correctly the significance of mtDNA variation 
(Alvarez-Iglesias et al., 2007). Another problem with this approach is that it fails when 
mtDNA suffers from heteroplasmy, where more than one mtDNA genome type can 
be present in individuals (Mechta et al., 2017). 


Applications of NGS in analysis of challenging samples 


Single nucleotide polymorphism typing 

SNPs are single-base variations, substitutions, insertions, or deletions at a unique physical 
location. SNPs are highly abundant, estimated to occur at 1 in every 1000 bases in the 
human genome, and represent the most common class of human polymorphism (Sachi- 
danandam et al., 2001; Venter et al., 2001). Nevertheless, there is an abundance of ge- 
netic information that can be exploited since approximately 85% of human variation is 
derived from SNPs (Cooper et al., 1985; Wang et al., 1998; Holden, 2002). 

According to previous reports, SNP markers are valuable for DNA typing of damaged 
samples by improving the quantity of genetic information obtained from challenging 
forensic samples (Gill, 2001; Budowle et al., 2004; Sobrino et al., 2005). It has been re- 
ported that SNPs are powerful on degraded and highly degraded samples and can be 
analyzed with rapid-high-throughput platforms using short amplicons, which is prefer- 
able for the success of amplification in most cases due to the small size of the amplified 
product (Dixon et al., 2006; Butler et al., 2007; Budowle & Van Daal, 2008). In this 
case, it is possible to optimize the PCR reaction to amplify the SNP markers with 
designed primers, immediately adjacent to targeted single base variations, required to 
amplify a 60—80 bp fragment (Budowle, 2004; Budowle et al., 2004; Divne & Allen, 
2005; Sobrino et al., 2005; Dixon et al., 2006). 

In addition, SNP analysis would take a substantial reduction in cost, increased 
throughput, and enhanced capabilities to resolve mixed samples for the forensic commu- 
nity to replace the STR loci (Gill et al., 2004; Sobrino et al., 2005; Butler et al., 2007; 
Oldoni & Podini, 2019). Several studies reported the effectiveness of SNP variants, 
particularly using high-throughput technologies, which have become increasingly 
important for the successful implementation of large criminal DNA databases (Divne 
& Allen, 2005; Borsting & Morling, 2016; Guo et al., 2017; Shih et al., 2018; Zavala 
et al., 2019; Haddrill, 2021). 

Indel polymorphisms are among the most suitable technologies for DNA testing that 
can help scientists overcome any issues arising from DNA’s low-template DNA or degra- 
dation due to the small-sized amplicons that make indels the most appropriate for gen- 
otyping DNA that has suffered high degradation (Pereira et al., 2009; Zidkova et al., 
2013; Klein et al., 2015). Indels were also chosen to be suitable for paternity cases due 
to their low rate of mutation compared with STRs (Klein et al., 2015). The indels- 
based method involves the identification of a specific mutation that provides diallelic 
markers (Pereira et al., 2009; Klein et al., 2015). One of the most well-known commer- 
cial kits for assessing indels polymorphisms is the Investigator DIPlex kit (Qiagen), which 
enables the forensic community to utilize this method in their investigations and was 
likely effective in challenging DNA analysis (Jordan & Mills, 2021; Klein et al., 2015). 
The kit can cover 30 Indel loci that are distributed in more than 9 autosomal chromo- 
somes (Zidkova et al., 2013; Klein et al., 2015). 


397 


398 


Next Generation Sequencing (NGS) Technology in DNA Analysis 


Due to their biallelic profile and the well-established STR loci databases, SNPs are 
unlikely to become the primary forensic markers and are considered not as informative 
per locus as the forensically selected STR loci. Another major problem with the SNP- 
based methods is that a panel of at least 50—100 autosomal SNP loci would be required 
to achieve the same power of discrimination that the existing STR multiplex systems can 
provide (Chakraborty et al., 1999). 


Application of NGS in challenging DNA evidence 


CE is the primary method used in forensic DNA laboratories around the world for 
separating and detecting STR alleles (Butler, 2012). However, the previously reported 
limitations to assessing effectively the challenging DNA samples necessitate further devel- 
opments in advanced tools to overcome those problems by targeting the entire genome. 
In particular, the analysis of standard forensic STR markers using MPS revealed several 
advantages over conventional CE, including (1) An increased number of loci that can 
be analyzed simultaneously, (2) A higher discrimination power due to increased STR 
allele sequence diversity and the greater number of loci, and (3) Shorter amplicons for 
a more effective analysis of degraded and/or low-quantity forensic biological evidence 
(Alonso et al., 2018). 

Given these benefits, NGS is a promising adjunct method for forensic casework in 
spite of the recommendation of the International Society for Forensic Genetics that 
the use of NGS in casework necessitates the availability of global datasets designed to 
enrich the allele frequencies of sequence-based STR genotypes (Parson et al., 2016; Phil- 
lips et al., 2018a,b). 


Degraded samples and low-template DNA evidence. 


STRs are the markers used for routine human identity testing in the forensic community. 
However, despite their high discrimination and robustness, STR assays frequently fail 
when genotyping extremely degraded samples (Butler et al., 2003; Ballantyne et al., 
2007; Bogas et al., 2009; Dash & Das, 2018; Haddrill, 2021). The NGS approach has 
become one of the more practical ways to address the issues that forensic scientists and 
other DNA laboratory specialists have suffered when handling limited DNA evidence 
due to its discrimination power, which can be increased through the simultaneous anal- 
ysis of multiple, diverse genetic markers, including not only STRs but also SNPs and 
mtDNA (Zavala et al., 2019). It has been reported that NGS technology has significantly 
improved results for low-level and degraded DNA samples as a result of the shorter 
amplicons compared with standard STR profiling (Haddrill, 2021). Another major 
advantage of NGS technology is its capacity to characterize single molecules, and there- 
fore it can deliver results with much lower concentrations of input DNA (Prosser et al., 
2016). 


Applications of NGS in analysis of challenging samples 


Previously, forensic scientists required considerable DNA input to analyze and 
generate evidence that could be used in a given case. However, contamination and 
degradation could not allow these professionals to obtain those large amounts, derailing 
and reducing the quality of their work. But fortunately, the discovery of NGS technol- 
ogy has reduced the DNA sample input that forensic scientists require to prepare NGS 
libraries for sequencing to obtain maximum information (Alvarez-Cubero et al., 2017). 

Multiple studies have reported that NGS technologies are going to be crucial for 
DNA human typing in cases where forensic specimens and samples are compromised 
and degraded. 

Zhan et al. evaluated the performance of [Wumina’s MiSeq FGx System in recovering 
forensic genomic sequencing data from degraded and low-template DNA. The analysis 
included a sensitivity study with a range of 1 ng—5 pg of 2800 M DNA. By incubating 
2800 M DNA at 98°C for 10, 20, 30, 40, 50, 60, 70, 120, and 180 min, DNA was arti- 
ficially and systematically degraded (resulting in degradation index values ranging from 
0.837 to 232.247) (Zhang et al., 2018). When the DNA input was greater than 50 pg 
or the degradation index was less than 72.28, the detected allele call frequencies were 
greater than 80%. When the samples were heated for more than 120 minutes or the input 
quantity was less than 25 pg, the allele balance was less than 0.6. The data also showed 
that the stutter type and ratio depend on the specific locus, with the sequencing run being 
the most important factor in noise occurrence. The findings of this study will aid labo- 
ratories in developing workflows for the analysis of difficult samples using the ForenSeq 
kit and the MiSeq FGx system (Zhang et al., 2018). 

On heavily degraded DNA samples, NGS appears to offer very promising results. To 
test the performance and reliability of the ForenSeq DNA Signature Prep kit using the 
MiSeq sequencer (Illumina), a trial set of DNA samples was artificially degraded by pro- 
gressive aqueous hydrolysis and analyzed along with the corresponding unmodified DNA 
sample and control sample (2800 M). In fact, it was able to give important results with 
three STRs (one autosomal and two Y-chromosome STRs) and 10 autosomal SNPs 
when the CE was unable to produce a readable profile for the most degraded trial sample. 
Furthermore, the 1/random match probability value (that is, the likelihood ratio or LR) 
calculated for the autosomal loci gave a result of 1.7 x 105, which can be considered an 
excellent achievement for such a highly damaged DNA sample (Fattorini et al., 2017). 

Elwick et al. assessed and sequenced bone and tooth samples exposed to a variety of 
DNA insults (cremation, embalming, decomposition, thermal degradation, and fire) us- 
ing Precision ID chemistry and a custom AmpliSeq STR and iiSNP panel on the Ion S5 
System, the ForenSeq DNA Signature Prep Kit on the MiSeq FGx system, and the Glob- 
alFiler PCR Amplification Kit on the (Kyleen Elwick et al., 2019). The results showed 
that traditional CE-based genotyping performed as expected, producing a partial or full 
DNA profile for all samples, and that both sequencing chemistries and platforms recov- 
ered sufficient STR and SNP information from the majority of the same difficult samples. 
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Despite the degree of damage in some samples, run metrics such as profile completeness 
and mean read depth produced good results with each system. The majority of sample 
insults (except decomposed) produced a similar number of alleles for MPS systems 
(Kyleen Elwick et al., 2019). 

Zavala et al. conducted a modeling study that shows that SNPs can increase the sig- 
nificance of identification when analyzing DNA down to an average size of 100 bps for 
input amounts ranging from 0.375 to 1 ng of nuclear DNA. The findings of this study 
were then compared to human skeletal material results (n = 14, ninth to eighteenth cen- 
turies), demonstrating the effectiveness of the ForenSeq kit for degraded samples. The 
Promega PowerSeq Mito System was also tested for reliability with human skeletal re- 
mains (n= 70, ninth to eighteenth centuries), resulting in successful coverage of 
99.29% of the mtDNA control region at 50 coverage or more. This was accompanied 
by changes to a common DNA extraction technique for skeletal remains that improved 
template recovery for shorter templates (Zavala et al., 2019). 

Developing an NGS assay that can capture mitochondrial probes is an essential 
component of this technology because it increases the sensitivity and specificity when 
dealing with a mixture of degraded samples (Calloway et al., 2014). NGS technology 
has also seen the development of software that can facilitate genome sequencing 
(Haddrill, 2021). In this way, NGS technology can overcome the limitations of chal- 
lenging DNA evidence. The Accel NGS 2S family of DNA library kits and the Accel 
NGS 1S Plus DNA Library Kit are among the NGS technologies that can aid the 
sequencing of damaged and degraded DNA (Regnault et al., 2021). 

Elwick compared the baseline tolerance of the two sequencing chemistries and plat- 
forms to inhibitors commonly found in human remains recovered from missing person 
cases (Elwick et al., 2018). Both the HID-Ion AmpliSeq Library Kit and ID Panel and 
the ForenSeq DNA Signature Prep Kit were susceptible to humic acid, melanin, and 
collagen; however, the ForenSeq kit handled inhibited melanin and collagen more effec- 
tively than the AmpliSeq kit. The ForenSeq kit, on the other hand, was resistant to the 
effects of hematin and calcium, whereas the AmpliSeq kit was severely inhibited by he- 
matin. When using the ForenSeq kit, STRs and SNPs showed the same trend among 
inhibitors (Elwick et al., 2018). 

The ForenSeq DNA Signature Prep kit was additionally used to investigate the effects 
of two well-known PCR inhibitors, humic acid and hematin. Humic acid and hematin 
resulted in decreased read numbers as well as specific negative effects on certain markers. 
Furthermore, it was demonstrated that a common capillary gel electrophoresis-based 
STR kit can handle at least 200 times more inhibitors than the ForenSeq DNA Signature 
Prep kit. This suggests that the PCR components could be improved to ensure analytical 
success for difficult samples, which is required for MPS to be widely used in forensic STR 
analysis (Sidstedt et al., 2019). 


Applications of NGS in analysis of challenging samples 


DNA mixture profile 


Initial MPS analyses of STR mixtures revealed that, when appropriately applied and 
interpreted, NGS could be more powerful, sensitive, informative, and perform better 
in mixture detection and deconvolution than conventional CE fragment analysis 
(Holland et al., 2011; Calloway et al., 2014; Van Der Gaag et al., 2016; Wang et al., 
2017; Kocher et al., 2018; Novroski et al., 2019; Tao et al., 2019; Ragazzo et al., 
2020). The bioinformatic framework My-Forensic-Loci-queries (MyFLq) for the anal- 
ysis of MPS forensic data has been also presented. MyFLq framework was applied to 
an IJumina MiSeq dataset of a forensic Wumina amplicon library generated from multi- 
locus STR PCR on both single contributor samples and multiple-person DNA mixtures 
(Van Neste et al., 2014). Certainly, NGS has the potential to improve mixture deconvo- 
lution even further by allowing the detection of specific intrarepeat variations within 
STR allelic variants of the minor contributor(s) in mixture DNA samples. However, 
these initial research studies also exposed the limitations of based STR analysis in detect- 
ing minor alleles due to the amplification of the major allele prior to library preparation 
(Gill & Haned, 2013; Tytgat et al., 2020; Van Der Gaag et al., 2016). 

This method is particularly useful in studying DNA mixtures for its ability to analyze 
amplicon pools generated through the use of several primer sets on DNA extracts from 
multiple specimens (Prosser et al., 2016) and to differentiate the contributors through iso- 
metric alleles (with the same length but different sequences) and informative variants 
(SNPs) within the flanking regions of the repeat motif (Tao et al., 2019; Guo et al., 
2017). These findings, while preliminary, may help forensic experts in mixture interpre- 
tation, and further investigations on this topic will need to be undertaken (Borsting & 
Morling, 2015). 

The previous study confirmed that NGS methods are powerful to detect sequences 
from the minor contributor in 1:50 and 1:100 mixtures, which is not possible with 
CE technology (Ragazzo et al., 2020). In these types of mixtures, it’s difficult to separate 
the reads of the minor contributor alleles from the stutters and noise sequences. The pres- 
ence of minor contributors was detected in saliva and urine samples using Precision ID 
GlobalFiler NGS STR Panel v2 in both dilutions, confirming the reliability and robust- 
ness of the NGS method for human identification in optimal/suboptimal amounts of 
DNA (Ragazzo et al., 2020). The “Precision ID GlobalFiler NGS STR Panel v2” kit 
provides greater identification power because 35 markers are analyzed as opposed to 
the traditional method’s 24 markers from the GlobalFiler Amplification Kit. 

For instance, Zeng et al. used the PowerSeq Auto System of 22 STR loci, one 
Y-STR, and the Amelogenin locus for sequence-based STR analysis on the MiSeq plat- 
form (Zeng et al., 2015). The research team compared the results obtained for DNA mix- 
tures using the PowerPlex Fusion System (Promega) on a CE platform. Mixtures at 
different ratios were amplified using a 0.5 ng input of DNA along with mock forensic 
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casework samples (blood/saliva and saliva/semen mixtures). Overall, the NGS results 
indicated that the prototype PowerSeq STR multiplex system could provide results 
that are concordant and comparable in performance with CE fragment analysis for the 
interpretation of two-person DNA mixtures. The NGS PowerSeq Auto System, how- 
ever, enabled the detection of more alleles than CE fragment analysis, with partial profiles 
of minor contributors reported up to a 19:1 ratio (Zeng et al., 2015). 

In 2016, van der Gaag et al. published a paper in which they described the effective- 
ness of the PowerSeq (Promega) system on the MiSeq sequencer on a series of mixture 
ratios (1:99, 5:95, 10:90, 20:80, 50:50, 80:20, 90:10, 95:5, and 99:1) using a total DNA 
input varying from 0.5—6 ng, with a minimum of 60 pg of the minor donor (Van Der 
Gaag et al., 2016). The sequence reads representing the minor contributors down to 
1% of the total DNA profile could only be recovered for a few STR loci. This was 
explained by the presence of extra PCR amplification or sequence error variants for 
some of the loci that could complicate the interpretation of STR profiles without prior 
knowledge of the mixture ratio, and also stutter issues (Van Der Gaag et al., 2016). 

A year later, Guo et al. tested the performance of the ForenSeq DNA Signature Prep 
Kit from the MiSeq FGx Forensic Genomics (Illumina) System in a mixture study. Mix- 
tures at various ratios (1:49, 1:19, 1:9, 1:1, 9:1, 19:1, and 49:1) were prepared at 1 ng of 
input DNA. ForenSeq Kit showed better performance than conventional CE-STR kits 
to generate the full profile of the minor contributor, but mixture deconvolution was 
challenging without using reference DNA profiles (Guo et al., 2017). 

In 2018, Kocher et al. published the results of a research project funded by the EU, 
DNASegEx (DNA-STR Massive Sequencing & International Information Exchange) 
investigating the ForenSeq DNA Signature Prep Kit on the MiSeq FGxplatform in a se- 
ries of male: female ratios of DNA mixtures prepared using 1—2 ng of DNA. The results 
of this study indicated that the minor DNA profile was detected down to a ratio of 1:20 
in all female-male and male-male pairs, while no allele was detected at more extreme 
ratios (1:1000), using either NGS or CE platforms (Kécher et al., 2018). This study 
supports evidence from previous observations by Jager et al. (2017). 

Oldoni and Podini (2019) have demonstrated that the sequence-based STR kit 
(ForenSeq DNA Signature Prep Kit) was more effective in extreme mixtures to report 
more alleles of the minor donor compared to CE (NGM Select PCR Amplification 
Kit and PowerPlex Y23 System). Contrary to expectations, this study found that the 
length-based PowerPlex Y23 kit was more performant to detect most of the male minor 
alleles at severely underrepresented male mixture contributions compared to the compre- 
hensive MPS multiplex kit (autosomal STRs, Y-STRs, and X-STRs, and individual 
identification SNPs) incapable to report them. A possible explanation for this might be 
the potential primer competition (preferential amplification) between female and male 
donors (Oldoni & Podini, 2019). 


Applications of NGS in analysis of challenging samples 


Wang et al. evaluated the performance of the Precision ID GlobalFiler NGS 32-plex 
STR Panel (Thermo Fisher Scientific) on the Ion S5 XL system to investigate a series of 
mixtures with a total amount of 1 ng. The study confirmed the benefit of mega multiplex 
MPS kits, which include multiple classes of DNA markers to detect minor alleles 
compared to conventional CE-STR typing kits (Wang et al., 2017). In this study, the 
detection of all non-overlapping minor alleles was possible at 1:4 and 1:9 ratios, while 
partial minor STR profiles (8-10 of the 25 non overlapping alleles) were detected at 
an approximately 1:19 ratio. However, no minor alleles were detected at most loci in 
higher-order mixtures (1:29 or 29:1) (Wang et al., 2017). 

In addition, NGS can increase the resolution of mtDNA analysis by detecting low- 
level mixtures such as heteroplasmy ( < 5%), which would not be detectable using stan- 
dard analyses of mtDNA polymorphism by Sanger sequencing with its limit of detection 
at ~ 10%—20% (Bar et al., 2000; Mamanova et al., 2010; Tsiatis et al., 2010; Parson 
et al., 2016; Just et al., 2015). Previous research has demonstrated the sensitivity of 
NGS in detecting and quantifying heteroplasmic variants and mixture components at 
very low levels (1:250), which may allow for mixture deconvolution (Holland et al., 
2011). The NGS study also demonstrated that Y chromosome sequencing could also 
distinguish between mixed samples of multiple males belonging to the same parent 
(Yang et al., 2014). 


Conclusion: The main challenges of NGS in forensic applications 


NGS technologies have contributed to MPS, leading to improved evidence generation 
for forensic purposes (Ballard et al., 2020). One of the most significant advantages of NGS 
is that all (or almost all) of the PCR-CE assays may be integrated into a single NGS assay 
ifa capture for the appropriate loci can be developed. In cases where additional studies are 
required, one NGS assay with several different markers will save time and potentially 
reduce the time a sample is examined in the laboratory (Borsting & Morling, 2016). 
Implementing NGS into forensic workflows is becoming more possible, especially 
since this technology is getting more affordable and with the development of bioinfor- 
matics tools to analyze the huge number of complex data produced (Hadadrill, 2021). 
However, NGS technology has its challenges that extend to forensic science, making 
it imperative to address them. Molecular aberrations and the overwhelming number of 
(new) variants that are detected are issues that scientists face when they use these methods 
to obtain information (Cottrell, 2018), as well as the variable performance of some 
markers in terms of coverage and locus imbalance (Ballard et al., 2020) and susceptibility 
to PCR inhibitors when compared with standard STR typing kits (Elwick et al., 2018; 
Phillips et al., 2018; Sidstedt et al., 2019; Young et al., 2019). Sometimes, the sophisti- 
cated bioinformatics systems needed can be time-consuming and costly to obtain. 
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The goal of the analysis of samples in forensic cases is to determine the identity of the 
individual(s) who contributed to the sample and possibly estimate any phenotypical char- 
acteristics of the individual(s) or identify the specific tissue type(s) in the sample. This re- 
quires relatively few markers, and a capture-based approach will be much more 
economical and require less sample material. However, some NGS systems can only 
sequence genomes the size of the human genome, and in normal shotgun sequencing 
operations, substantial regions of the genome are only covered by a low number of reads, 
which leads to a high risk of misinterpretation of sequencing results, as well as a lack of 


reproducibility in concordance studies (Borsting & Morling, 2015). Exome sequencing 


may reveal variants in other genes that may be disease-related. A number of ethical 
and legislative considerations are raised by introducing NGS in forensic genetics How 
to handle this information is challenging for clinicians and forensic genetics specialists, 
too. How much and which loci to sequence, what to report, and whether ignoring 
sequence information is prudent are certainly critical questions to ask and require the 
appropriate answers (Borsting & Morling, 2016). 
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CHAPTER 21 


Detection of human body fluid through 
MRNA analysis using NGS 


Quentin Gauthier 
Alexandria, VA, United States 


Introduction 


The use of RNA for forensics has seen significant interest in the past two decades 
(Borsting & Morling, 2015). Although a relative newcomer to forensic genetics, RNA 
analysis has the potential for a wide variety of use cases not offered by DNA, such as post- 
mortem interval, time since deposition, chronological age, and body fluid identification 
(Bauer et al., 2003; Borsting & Morling, 2015; Peters et al., 2015; Weinbrecht et al., 
2017). Body fluid identification in particular has captured the attention of forensic scien- 
tists for its ability to contextualize the presence of biological material at crime scenes and 
in evidence. The ability to distinguish saliva from semen in a stain can substantiate alle- 
gations of sexual assault in a way that the presence of DNA simply cannot. Current meth- 
odologies employed by forensic laboratories most commonly include enzymatic assays 
and microscopic analysis (Scientific Working Group on DNA Analysis Methods 
(SWGDAM), 2015). Although these methods are relatively cheap and easy to employ, 
there are a number of downsides associated with them. Some tests, such as the SALIZAE 
reagent, require 10 UL of saliva for visualization and therefore aren’t suitable for trace 
analysis (Old et al., 2009). The RSID-Semen lateral flow assay is highly specific for seme- 
nogelin; however, it suffers from false negatives in the presence of other body fluids 
commonly found in mixtures in forensic contexts (Old et al., 2012). Still other tests, 
like the ABAcard HemaTrace are destructive to the portion of the sample tested, which 
isn’t the desired result for any forensic DNA analysis method (Kulstein & Wiegand, 
2018). 

Other approaches, including DNA methylation, real-time polymerase chain reaction 
(RT-PCR), and high-resolution melt analysis targeting microRNA have seen significant 
attention in the past decade (Antunes et al., 2016; Gauthier et al., 2019; Haas et al., 2009; 
Madi et al., 2012; Park et al., 2014; van den Berge et al., 2014; Zubakov et al., 2010). 
However, next-generation sequencing for messenger RNA (mRNA) has only recently 
begun to receive the same attention. Having mRNA as a target for body fluid identifica- 
tion is a logical conclusion, as it has been reported to be a stable molecule that can be easily 
and reliably recovered from evidence submitted for analysis (Setzer et al., 2008). With 
approximately 22,000 coding genes in the human genome, not all of the genome is 
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expressed in every cell type (Salzberg, 2018). The presence of mRNA, therefore, can be 
an indicator of cellular function and an indicator of the bodily origin of the mRNA (Melé 
et al., 2015). Roughly 12% of RNA transcripts are enriched in a single tissue, 10% are 
expressed in a subset of tissues at a higher rate than other tissues, and approximately 
40% are housekeeping genes that would be expressed in all cells (Karlsson et al., 2021; 
Uhlén et al., 2015). Another benefit to the analysis of RNA is the ability to coextract 
DNA and RNA simultaneously (Bowden et al., 2011). With this approach, it is possible 
to generate a body fluid identification from RNA and a full DNA profile from the same 
sample of evidence, limiting the possibility of biased sampling (Alvarez et al., 2004). RNA 
and DNA coextraction utilize both organic- and column-based methods, with both being 
hands-on approaches rather than automated (Parker et al., 2011; Watanabe & Akutsu, 
2020). To analyze extracted mRNA, next-generation sequencing could provide a higher 
level of specificity, sensitivity, and coverage than many of the previous technologies, such 
as RT-PCR followed by capillary electrophoresis or rt-quantitative polymerase chain re- 
action (Bustin, 2000; Juusola & Ballantyne, 2007; Lindenbergh et al., 2012). Efforts have 
even been made to develop assays to simultaneously analyze DNA and RNA at the same 
time, a feat made possible by the use of next-generation sequencing (Zubakov et al., 
2015). While the number of studies targeting mRNA for body fluid identification using 
next-generation sequencing remains limited, the results have shown great promise for the 
field. Recently made efforts have focused on expanding the number of body fluid-specific 
mRNA transcripts that are known, creating assays for targeted sequencing, and exploring 
newer sequencing technologies. 


Body fluid identification 


A multitude of transcripts have been identified and characterized for their ability to iden- 
tify a variety of forensically relevant body fluids. Typically, saliva, semen, blood, and 
vaginal epithelia are the primary body fluids encountered; however, mRNA sequencing 
has found a number of markers for other body fluids, such as sweat, nasal mucosa, men- 
strual blood, and others (Sakurada et al., 2011). Body fluid identification via transcripts 
relies on the relative stability and abundance of certain transcripts in tissues, while low 
abundance and stability transcripts can only be used on very fresh samples (Lin et al., 
2015). The use of next-generation sequencing platforms for mRNA sequencing, while 
relatively new, has already produced a number of promising studies for the identification 
of body fluids. Table 21.1 lists the markers that have been published in recent next- 
generation sequencing studies. 

In 2015, Zubakov et al. produced the first proof-of-concept study for the targeted 
sequencing of DNA and RNA simultaneously (Zubakov et al., 2015). In their work, 
the assay targeted 9 autosomal STRs, amelogenin, and 12 mRNAs for body fluid iden- 
tification. The mRNA targets were capable of positively identifying blood, menstrual 


Table 21.1 Published mRNA body fluid identification markers derived from NGS. 


Group 


Zubakov 
et al. 
(2015) 

Lin et al. 
(2015) 


Hanson 
et al. 
(2018) 


Ingold 
et al. 
(2018) 


Ingold 
et al. 
(2020) 


Chirnside 
et al. 
(2020) 

Albani & 


Fleming 


(2018) 


Sequencing 
platform 


Ion Torrent 


HiSeq2500 


MiSeq FGx 


Ion Torrent 


Ion Torrent 


MiSeq FGx 


HiSeq2500 


Saliva 


Vaginal 


Blood fluid 


CYP2B7P1 
MUC4 


CYP287P1 
DKK4 
FAM83D 
CYP2A6 


CYP287P1 
DKK4 
FAM83D 
CYP2A6 


CYP287P1 
DKK4 
FAM83D 
CYP2A6 
CYP2A7 


HBD 
SLC4A1 


Semen 


Menstrual 
blood 


MMP10 
MMP11 


MMP11 


MMP10 
LEFTY2 
MMP7 
MMP11 
SFRP4 


MMP10 
LEFTY2 
MMP7 
SFRP4 


MMP10 
LEFTY2 
MMP7 
MMP11 
SFRP4 


Nasal 


Skin/Sweat mucosa Spermatozoa 


LCE1C 
CCL27 

IL37 
SERPINA12 
KRT77 
COL17A1 
LCE1C 
CCL27 

IL37 
SERPINA12 
KRT77 


LCE1C 
CCL27 

IL37 
SERPINA12 
KRT77 
COL17A1 


OPRPN 
BPIFA1 


Housekeeping 


HPRT1 
SDHA 


UBE2D2 
TCEA1 
G6PD 
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blood, saliva, semen, vaginal secretions, and skin in single-source samples. Additionally, 
in 2015, Lin et al. analyzed degraded RNA samples for their ability to be successfully 
sequenced in NGS applications while providing sufficient coverage for positive identifi- 
cation of body fluids (Lin et al., 2015). In their study, they developed markers targeting 
saliva, blood, menstrual blood, and vaginal fluid. Each body fluid was successfully iden- 
tified even after degradation by aging for 6 weeks, resulting in RNA Integrity Numbers 
falling below eight, the minimum recommended RNA quality for sequencing. In a 
follow-up study in 2016, Lin et al. described the concept of transcript stable regions 
(StaRs) (Lin et al., 2016). These regions of the transcriptome were identified by compar- 
ing the transcriptomic sequencing coverage of fresh and degraded samples. Lin et al. pro- 
posed that future assays for body fluid identification target these StaRs to achieve 
consistent results regardless of the sample quality. This would be achieved by designing 
primers highly specific to a small number of body fluid-specific transcriptomic regions. In 
2018, Hanson et al. published a targeted mRNA multiplex sequencing approach that tar- 
gets 33 different transcripts for the identification of saliva, blood, vaginal fluid, semen, 
menstrual blood, and skin (Hanson et al., 2018). The sensitivity of this assay was shown 
to be as low as 5—10 ng among the body fluids tested, despite the 50 ng recommended 
input from the manufacturer of the library preparation kit. Additionally, blinded single 
source and mixture samples were sent to another laboratory for analysis. Of the 16 sam- 
ples, 15 were accurately identified, with the 16th sample having the major component 
accurately identified. Additionally, in 2018, Ingold et al. presented the results of a collab- 
orative experiment using the Hanson 33-plex on the MiSeq FGx platform and an in- 
house 29-plex on the Ion Torrent S5 system (Ingold et al., 2018). In this study, a dozen 
laboratories were invited to test both platforms on a standardized set of samples. Although 
some interlaboratory variability on the read counts was observed, the conclusions of each 
laboratory were quite similar. In this trial, the S5 workflow demonstrated lower reliability 
with the low input samples provided to the various laboratories. In 2019, Hanson et al. 
developed an 11-target assay for mRNA body fluid identification of saliva, blood, and 
semen containing 21 coding region single nucleotide polymorphisms (cSNPs) (Hanson 
et al., 2019). Their assay was capable of successfully identifying the three body fluids tar- 
geted and differentiating saliva, blood, and semen transcripts from different donors. A 
follow-up study in 2020 by Ingold et al. once again invited laboratories to compare 
the Hanson 33-plex assay to an improved 35-plex assay from Ingold et al. that used 
the previous 29 targets as well as 6 new targets and incorporated cSNPs that are located 
inside of mRNA transcripts targeting various body fluids (Ingold et al., 2020). The 
benefit of the 35-plex is that the cSNPs are capable of linking specific individuals to a 
particular body fluid. However, the performance of the cSNPs in this study was unreli- 
able, leading to ambiguous results overall. The authors recommended pursuing addi- 
tional, more stable cSNPs in an even larger assay to improve the ability to identify 
body fluids and associate individuals with a specific body fluid. The overall performance 
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of the two assays on two different platforms remained comparable and represented the 
largest targeted multiplex assays currently published. 

Other groups have contributed to the region of body fluid identification via mRNA 
sequencing by identifying markers for other body fluids that, while less common in fo- 
rensics, are important for understanding the full spectrum of body fluids that could be 
encountered at a crime scene. Chirnside et al. set out to develop markers for the specific 
identification of nasal mucosa as a body fluid distinct from saliva and tongue samples, as 
well as blood, semen, and vaginal fluids (Chirnside et al., 2020). Their initial results 
showed six possible candidates for nasal mucosa identification, but after further specificity 
testing against all body fluids, OPRPN and BPIFA1 were determined to be the only two 
markers that they could identify with sufficient specificity. The justification for identi- 
fying nasal mucosa as a body fluid is to eliminate the reported cross-reactivity of 
mRNA assays for saliva and nasal samples (Roeder & Haas, 2013). In 2018, Albani 
and Fleming identified several new markers using RNA-Seq data from Lin et al.’s 
work from 2015. In this new study, several new markers were identified for blood, men- 
strual blood, semen, and more specifically spermatozoa (Albani & Fleming, 2018). The 
TNP1 transcript was found exclusively in semen samples containing sperm, while 
KLK2, which is specific for semen, was found in all semen samples. This differentiation 
of spermic and aspermic semen can allow for body fluids to be identified in individuals 
regardless of a person’s reproductive ability. Not seen in the literature thus far are markers 
specific for urine, feces, breast milk, or tears. While these body fluids are not the most 
commonly found body fluids in evidence or at crime scenes, the ability to differentiate 
among all sources of genetic material shed from the human body is crucial. 


Sequencing approach 


With the advancements in next-generation sequencing, the analysis of RNA has not 
necessarily been a priority at all times. For this reason, the majority of NGS approaches 
still rely on sequencing complementary DNA (cDNA). The cDNA is synthesized 
through commercial Reverse Transcription (RT) kits that have varying RT enzyme 
properties that can affect the resulting product (Stahlberg et al., 2004). This variation 
likely contributed to the differing assay sensitivities that were seen in the interlaboratory 
studies conducted by Ingold et al. in 2018 and 2020. Ideally, a reference marker would be 
included in all assays tested to monitor the effects of different RT kits, the presence of 
inhibitors, or normalization of data across samples due to a lack of mRNA-specific quan- 
tification methods. Unfortunately, the expression of published reference genes also 
referred to as housekeeping genes, has been shown to vary significantly between and 
even within individuals (Moreno et al., 2012). This can cause issues when comparing 
the results of different individuals. For this reason, the inclusion of a spike-in of synthetic 
RNA by the External RNA Controls Consortium is recommended to help normalize 
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the data (Pine et al., 2016). Unfortunately, this is typically done after extraction and RT 
of the sample and is therefore most helpful for gauging the efficacy of the library prep- 
aration process. 

Library preparation of CDNA for RNA-seq follows a similar theme among the major 
commercial offerings. The cDNA can either be targeted with specific primer-adapter 
complexes or can be nonspecifically ligated with adapter sequences that are recognized 
by the respective sequencing platform. The sequencing platforms most commonly 
seen in the literature are offered by Illumina, as the MiSeq FGx, and ThermoFisher, as 
the Ion Torrent S5. These two instrumentation paths, while distinct, generally follow 
the same concept of sequencing by synthesis (SBS) following clonal amplification of 
captured molecules of cDNA on a flow cell (Kumar et al., 2019). This approach has rela- 
tively low error rates but is limited to short reads (~ 150 bp). Another commonality of 
SBS instruments is the error rate introduced by base misincorporation events. This 
fact, combined with the potential base misincorporation events during reverse transcrip- 
tion, means that RNA-seq data is particularly prone to having errors in the resulting data 
sets, causing downstream data analysis headaches (Le et al., 2013). An alternative to SBS 
exists in the nanopore sequencing approach employed by Oxford Nanopore Technol- 
ogy. Briefly, a motor protein pulls a nucleic acid molecule through a pore, and the alter- 
ations in the ionic current through the pore can determine the specific nucleotide order 
of the molecule. This sequencing method, combined with the newly launched Direct 
RNA Sequencing kit, which directly targets PolyA RNA, could eliminate the error 
introduced by reverse transcription and SBS (Depledge et al., 2019). At this time, how- 
ever, the innate error rate for this technology is significantly higher than SBS, though im- 
provements are being made at a rapid pace (Sahlin et al., 2021). 


Beyond just body fluids 


The sequencing of mRNA offers a wealth of information beyond just body fluid iden- 
tification. And while outside the scope of this chapter, it should be noted that there are 
more applications. As previously discussed, the use of cSNPs in body fluid-specific 
mRNA markers can allow for the body fluids to be matched to individuals (Ingold 
et al., 2020). A set of mRNA markers was identified for correlation with the postmortem 
interval (Zhu et al., 2017). Advancements in this work could allow for the determination 
of time since death for individuals beyond the 27-h constraint of this particular study. The 
time since deposition of biological material can also be predicted for some of the 
commonly encountered body fluids in crime scenes up to 9 months old (Salzmann 
et al., 2019). A panel of 16 RNA markers was found to predict the chronological age 
of an individual to within 4 years in dermal fibroblast cells (Fleischer et al., 2018). While 
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variation would be expected in other body fluids, the ability to predict the age of donors 
from RNA-seq would be invaluable to investigators. While not mRNA markers, the 
long noncoding RNAs RPS4Y and XIST have been identified as gender-determining 
markers for men and women, respectively, and the markers are found in a variety of 
commonly encountered body fluids as they are generally considered housekeeping genes 
(Wang et al., 2021). A side product of RNA-seq analysis of body fluids is the inclusion of 
microbial species that have been coextracted. These microbial species have been found to 
be specific for different locations on the human body and are therefore correlated with 
specific body fluid samples (Salzmann et al., 2019; Swanson et al., 2020). 


Analysis methods for mRNA data 


As the adoption of RNA-Seq has increased, so have the methods for data analysis of the 
resulting sequence data. The standard treatment of data includes read trimming, quality 
control, alignment, and counting (Sheng et al., 2016). A large number of bioinformatic 
methods for the interrogation of gene and isoform abundance in datasets have emerged 
but are generally categorized into two groups: transcript-based and union-exon-based 
approaches. Transcript-based approaches attempt to analyze data by matching reads to 
individual isoforms (Li & Dewey, 2011). As some genes can have a large number of iso- 
forms, accurately identifying a transcript read to a specific isoform is challenging. In 
union-exon methods, all exons of a gene are combined into a single union exon, and 
a transcript read is matched if it has sufficient identity with the union exon (Liao et al., 
2014). In this way, reads can be matched to a gene much more easily, but isoform abun- 
dance in a sample is lost (Zhao et al., 2015). These methods allow for intersample differ- 
ential gene expression to be determined for comparison. In Hanson et al.‘s 2018 study, 
they employed the presence of mRNA markers in a sample to identify each of the body 
fluids in the study by agglomerative hierarchical cluster analysis (Hanson et al., 2018). 
This method treats the mRNA markers as a binary event that isn’t prepared to handle 
mixtures of samples and would have difficulties with low-abundance samples missing 
some but not all markers characteristic of body fluid. Dorum et al. (2018), took the Han- 
son study data set and used the actual read counts of each gene to create a model based on 
partial least squares and linear discriminant analysis (Dorum et al., 2018). This approach 
resulted in higher predictive accuracy for the samples as well as the ability to accurately 
identify the different components in a body fluid mixture sample. To bring mRNA body 
fluid identification in line with contemporary DNA analysis, efforts have been made to 
apply likelihood ratios (LR) for body fluid prediction. Ypma et al. used in silico mixing of 
real body fluid mRNA profiles to test a prediction model based on the ratio H: the sam- 
ple contains at least one body fluid of interest and Ho: the sample contains none of the 
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body fluids of interest (Ypma et al., 2021). The authors caution, however, that the result- 
ing LRs won’t be on the same scale as LRs generated for DNA profiles. This is because 
the LR for this model should be constrained to the number of data points in the H2 cate- 
gory. To improve the results, a greater amount of data needs to be generated and 
compiled into a database. Additionally, efforts have been made to apply different machine 
learning methods to the analysis of RNA-Seq data rather than traditional differential gene 
expression (Wang et al., 2018). This approach uses artificial intelligence and information 
theory to make algorithms that learn from one dataset and then make predictions on new 
datasets. 


Future of mRNA body fluid identification 


The shift from serology testing in forensic laboratories is a necessary change in the world 
of forensic biology. An accurate confirmatory test derived from the same extraction that 
produces DNA profiles will allow examiners to confidently speak as to the origin of bio- 
logical materials in a more precise manner. Next-generation sequencing of mRNA for 
body fluid identification offers exactly that confirmatory test. However, its implementa- 
tion into the casework sections isn’t ready yet. There are still a number of body fluids 
without known mRNA markers that haven’t been reported in the literature as having 
been tested with the known markers of commonly encountered body fluids. To be 
able to confidently identify one body fluid requires the exclusion of all other body fluids. 
Urine, feces, tears, and breast milk remain the body fluids that, while not the most 
commonly submitted articles for evidence, do not yet have mRNA markers for confir- 
mation. This must be rectified. 

Additionally, the data that has been collected for mRNA sequencing thus far should 
be shared in a publicly accessible database. As there are a multitude of extraction methods, 
reverse transcription enzymes, library preparation processes, sequencing platforms, and 
bioinformatic pipelines, it is expected that there will be some variation in the data that 
is observed for a particular body fluid. A large collaboration and public sharing of data 
will allow researchers to better define the characteristics needed to identify body fluids 
in a manner that will be transparent and reliable. This transparent and reliable method 
of analysis could even come as a recommendation of specific or general guidelines as 
put forward by the regulatory agencies of the respective countries. 

And finally, the most critical step for the introduction of next-generation sequencing 
of mRNA for body fluid identification to a wide audience of criminal investigation lab- 
oratories is testing the science in a court of law. Many laboratories around the world are 
hesitant to try new technologies—caution is absolutely warranted— to the extent that 
new methods languish in academic settings for years without making it to a courtroom. 
A crime lab needs to take it upon itself to forge a new path for body fluid identification 
and bring mRNA sequencing into the light. 
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Background 


The detection of various body fluids such as blood, semen, vaginal fluid, saliva, menstrual 
blood, and urine is a key aspect of the criminal investigation. Identification of body fluid 
type and origin helps in the reconstruction of a crime scene by linking the sample donors 
and real criminal acts (An et al., 2012). Many body fluids are invisible to the naked eye, 
and most of them share a common morphology, which makes it difficult to distinguish 
them (Harbison & Fleming, 2016). Besides, when two body fluids are mixed at a crime 
scene, it becomes immensely strenuous for the investigators to decipher the nature of the 
body fluid. In current practice, the body fluids/stains are ultimately used for individual- 
ization through DNA profiling. However, the generation of a DNA profile from the 
wrongdoer does not disclose the situations by which it was transferred. Correct identifi- 
cation of body fluids before DNA profiling provides a source for a DNA profile from the 
samples, which ultimately increases its evidentiary value as well as providing investigative 
leads in a case. 

For better identification of body fluids, innovative techniques are being developed 
with little consumption of evidence. In this regard, attenuated total reflection Fourier 
transform infrared (ATR-FTIR) spectroscopy and Raman spectroscopy coupled with 
chemometrics have shown huge potential as nondestructive body fluid identification 
techniques (Elkins, 2011; Takamura et al., 2018; Virkler & Lednev, 2010). The detection 
of tissue-specific expression profiles of microRNAs (miRNAs) using quantitative reverse 
transcription PCR (RT-PCR) has also been used as an alternative molecular body fluid 
identification technique (Zubakov et al., 2010). Besides, different RNA markers and 
tissue-specific differentially methylated regions have been identified to distinguish foren- 
sically relevant body fluids (Lee et al., 2012). For rapid detection of blood, semen, and/or 
saliva at the crime scene, quantum dot molecular beacon-based techniques have also been 
developed to test tissue-specific RNAs (Young et al., 2017). Additionally, many 
biosensor devices have been developed for real-time analyte/body fluid detection using 
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innovative optical sensing techniques (Frascione et al., 2013). Other techniques, such as 
immunochromatographic strip tests, have also been developed for onsite body fluid 
detection (Old et al., 2012). 

Identification of body fluid alone does not necessarily meet the purpose of forensic 
analysis. Species determination is also of utmost importance for providing investigative 
leads. In most cases, limited samples are available for forensic analysis. The preliminary 
tests being conducted for body fluid screening are highly destructive, giving rise to 
sample-limiting conditions for further individualization testing. Rapid on-site testing 
of body fluids without waiting for laboratory-generated reports is highly useful for 
forensic investigation purposes. Though significant advancements have taken place in 
body fluid identification research and many products are available for their detection, a 
practical model with a gold-standard application is still lacking (Virkler & Lednev, 
2009). Thus, a body fluid detection technique with no destructiveness, little consump- 
tion of samples, a successful outcome on a variety of substrates, and high sensitivity has 
become the need of the hour. An overview of the technologies used for forensic inves- 
tigation is presented in Fig. 22.1. 

The human body acts as a micro-ecology and harbors indigenous microbiota as com- 
mensals (Tannock & Tannock, 1999). Extensive analysis of human normal microflora re- 
veals the presence of predominant species types in different body niches such as the oral 
cavity, skin, vagina, stomach, large and small intestine, and urinary tract (McFarland, 
2000). In this regard, analysis of microbial patterns shows huge potential as a better alter- 
native for forensically relevant body fluid detection (Hanssen et al., 2017; Oliveira & 
Amorim, 2018). Though much progress has taken place in microbial-based human 
body fluid identification research to date; however, the development of such a model 
is still at a nascent stage. The current chapter emphasizes the availability of current 


Cause of Death 


Post-mortem time interval 


Investigative leads 


Figure 22.1 Various aspects of criminal investigations using body fluids found at the crime scene. 
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Protein Markers 
Alpha-1-antitrypsin 
Complement C3 
Hemoglobin subunit beta 
Hemopexin 
Calpastatin 


mRNA Markers 
PRM1, PRM2 
Semenogelin-1 
SPTB, PBGD 
ALAS2 


HBD-1 
MUC4 


Figure 22.2 Use of various techniques and molecular markers for identification of body fluids from 
biological samples. 


conventional techniques for human body fluid detection and the exploitation of other 
technological advancements in the field of body fluid identification research, discussing 
their respective pros and cons. Different techniques used for the identification of body 
fluids have been summarized in Fig. 22.2. 


Existing technologies for human body fluid detection 


Many body fluid identification techniques have been developed over the past century 
and are currently in practice in different forensic science laboratories throughout the 
world. Most of them include chemical tests, immunological tests, protein catalytic activ- 
ity tests, spectroscopic methods, and microscopy (An et al., 2012), which have been 
described as follows: However, most of them are presumptive, and the search for a 
gold standard technique for human-specific body fluid identification continues. 


Conventional techniques 

Semen 

The detection of human semen is one of the most prerequisites in investigation of sexual 
assault cases. The most common techniques currently used for human semen detection 
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include a visual examination, followed by the use of ultraviolet light to detect fluorescent 
properties of semen and chemical analysis for the detection of acid phosphatase, an 
important component of seminal fluid (Stefanidou et al., 2010). On manual observation, 
the seminal fluid on any fabric appears viscous, stiff, and crusty. For further confirmation, 
they should be observed under UV light between 400 and 700 nm, as semen is fluores- 
cent due to the presence of conjugated chlorine and/or flavin proteins (Kobus et al., 
2002). However, the presence of ointments and creams on fabric has been reported to 
give false-positive results using alternate light sources (ALS) (Santucci et al., 1999). In 
this regard, other commercial ALS, such as Bluemaxx BM500 and Polilight, have shown 
better sensitivity toward seminal fluid detection. But none of these light sources predicts 
the human specificity of the seminal fluid. Further, the detection of seminal acid phos- 
phatase is widely used for the recognition of seminal fluid. After hydrolysis of organic 
phosphates by the acid phosphatase enzyme, the hydrolyzed product reacts with diazo- 
nium salt chromogen to give rise to a change in color. A combination of many com- 
pounds, such as alpha-naphthyl phosphate and Brentamine Fast Blue, beta-naphthol 
and Fast Garnet B, alpha-naphthol and Fast Red AL, p-nitroaniline, NaNO3, a- 
naphthyl phosphoric acid, and aqueous magnesium chloride, is popularly in practice 
for this test (Banerjee, 1983; Gaensslen, 1983; Kosa et al., 1987). 

However, any of the previously described tests cannot be considered confirmatory as 
studies have shown false positive results with certain plant materials such as strawberries, 
broad beans, onions, plums, vaginal acid phosphatase, and feces (Gaensslen, 1983; Kosa 
et al., 1987; Vennemann et al., 2014), and acid phosphatase tends to degrade over time 
due to exposure to heat, mold, putrefaction, or chemicals (Takayama et al., 2001). The 
most widely accepted confirmatory test for semen detection is the microscopic identification 
of sperm cells using Gram-modified Christmas tree stain, hematoxylin and eosin, Baecchi’s, 
Papanicolaou’s, and Wright’s stains (Kondracki et al., 2017). In conditions of natural azoo- 
spermic individuals and vasectomized persons, the microscopic examination does give a false 
negative result. Thus, other techniques have been developed to detect prostate-specific an- 
tigen, seminal vessel-specific antigen, semenogelin (Sg), Sg II, 19-OH Fj¢/F2,, prostaglandin, 
and monoclonal antibody 1E5 with limited cross-reactivity (Virkler & Lednev, 2009). Each 
of these techniques has its own pros and cons. The user has to choose the appropriate meth- 
odology for the detection of semen as a body fluid as per its requirements. 


Blood 

Blood and bloodstains are the most common biological evidence found at a crime scene. 
To detect latent bloodstains at crime scenes and/or on a dark background, UV light as 
ALS can be applied using light source products such as Polilight. A recent study showed 
that UV light at 365 nm wavelength is highly effective in detecting even traces of human 
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blood serum (Kearse, 2020). However, a comparative analysis of UVA, UVB, and UVC 
showed that UVC is the most damaging to human blood DNA, as many STR loci could 
not be amplified after exposure to UVC (Uchigasaki et al., 2018). Another study showed 
the presence of bipyrimidine photoproducts, oxidative lesions, and strand breaks fol- 
lowed by inhibition of STR profiling when bloodstains are exposed to UV light of 25 
to 1236 J/cm? (Hall et al., 2014). Thus, ALS particularly the shorter-wavelength UV 
light should be carefully used while detecting latent bloodstains on any surface. Another 
presumptive test for blood that has been carried out at the crime scene for a very long 
time is the luminol test. The principle behind this test is that hemoglobin and other blood 
derivatives enhance the oxidation of luminol in an alkaline environment. Popular forms 
of luminol testing such as Grodsky formulations, Weber, Weber II, and Bluestar1 alter- 
natives have also shown detrimental effects on subsequent DNA analysis besides giving 
false positive and/or negative results in surfaces covered with terracotta, ceramic tiles, 
polyurethane varnishes, or jute and sisal matting (Passi et al., 2012; Quickenden & 
Creamer, 2001). Another presumptive test, the benzidine test, which exploits the 
peroxidase-like activity of the heme group, has become obsolete due to the carcinogenic 
nature of benzidine and the cross-reactivity of this test with chemical oxidants and plant 
peroxidases (Colotelo et al., 2009). Other catalytic tests, such as the phenolphthalein test 
and leucomalachite green test are not as sensitive as the luminol test but can detect blood 
at a dilution of 1:10,000. However, such tests do not give the species origin of 
bloodstains. 

A microscopic examination of red and white blood cells along with fibrin can be 
considered for confirmation of bloodstains. The most popular blood confirmation test 
is the Teichman crystal test, where hematin is formed when a dried bloodstain is heated 
in the presence of a halide and glacial acetic acid. Measurement of absorbance at 400 nm, 
the absorption maxima of hemoglobin can also confirm the body fluid. However, hin- 
drance from water submersion, sunlight exposure, heating, and rust has been reported in 
this technique (Bremmer et al., 2011). Isozyme analysis to compare lactate dehydroge- 
nase isozyme patterns and enzyme-linked immunosorbent assays are other confirmatory 
tests for the detection of human blood besides identifying different blood groups using 
specific antibodies (Virkler & Lednev, 2009). Besides, the chromatography technique 
for the separation of hemoglobin and its derivatives by the development of hematopor- 
phyrin can also confirm the presence of blood. However, the lengthy development time 
and necessity of presaturation limit the use of this technology in current-day forensic 
practice. 

Further, it becomes immensely difficult to distinguish between peripheral blood and 
menstrual blood. In this regard, a duplex test combining human hemoglobin and D- 
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dimer detection has recently been developed to detect blood and menstrual fluid (Holt- 
k6tter et al., 2018). Applications of ATR-FTIR spectroscopy and chemometrics have 
also shown a huge potential to distinguish between peripheral and menstrual blood 
(Sharma et al., 2020). 


Vaginal fluid 

Vaginal fluid as evidence plays a major role in sexual assault cases. As the composition and 
pH of vaginal fluid change over different phases of the menstrual cycle, testing specific 
components is highly challenging (Wagner & Levin, 1980). However, many presumptive 
and confirmatory tests have been proposed and are currently in use in forensic practice. 
The most common test includes the detection of glycogenated epithelial cells using a pe- 
riodic acid-Schiff reagent. Shreds of evidence showed that women in the prepubescent or 
postmenopausal stages do not have glycogenated cells (Anderson et al., 2012), which 
tends to give false-negative results in those cases, questioning the reliability of this test. 
Other techniques to detect vaginal peptidase, esterase, alkaline phosphatase, b- 
glucuronidase, DPNH-diaphorase, and estrogen receptors have also been developed 
for the identification of vaginal fluid (Divall, 1984; Gaensslen, 1983; Hausmann et al., 
1996). However, most of these techniques have their loopholes, as false-positive results 
have been reported from male prepuce and urethral mucosa samples. In another 
approach, pristine vaginal fluid can be distinguished from a mixture of vaginal and sem- 
inal fluid by analyzing the ratio of lactic acid and citric acid, as vaginal fluid contains large 
quantities of lactic acid whereas seminal fluid contains large quantities of citric acid (Aldu- 
nate et al., 2013). As both of these carboxylic acids are found in traces in other body fluids 
such as saliva and urine, such preliminary tests need to be confirmed by other molecular 
levels of examination. 


Saliva 

Besides semen and blood, saliva is also found commonly at a crime scene which needs to 
be detected by forensic examination. Though the many presumptive tests have been re- 
ported for this body fluid, no confirmatory test has been proposed to date for saliva detec- 
tion. ALS is widely used to locate saliva stains as they appear bluish-white under UV 
(Fiedler et al., 2008). Though detection of activity of amylase is widely used as a pre- 
sumptive test for saliva due to the presence of more amount of AMY1 in saliva, cross- 
reactivity of this test has been detected in other body fluids such as pancreatic secretions, 
semen, and vaginal fluid. Such tests also generate false-positive results for hand cream, 
face lotion, washing powders, urine, and feces (Virkler & Lednev, 2009). Microscopic 
examination such as scanning electron microscopy (SEM) coupled with energy dispersive 
X-ray (EDX) provides tentative identification of saliva by detecting the relative concen- 
tration of sodium, phosphorus, sulfur, chlorine, potassium, calcium, and other trace 
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metals, where potassium is found in the form of the largest peak (Saibaba et al., 2017). 
The use of such a sophisticated instrument for preliminary screening of body fluids can 
be a huge burden on the laboratory’s resources. 


Urine 

In cases of sexual assault, harassment, and mischief, the examination of urine as a body 
fluid plays a great investigative lead. The unique odor of urine can be found on the whole 
object, but it becomes immensely difficult to localize the stain’s position as it spreads out 
quickly and is diluted on any object. Ionic characterization of urine shows the presence of 
a higher amount of phosphate and sulfate, suggesting its usefulness in body fluid detec- 
tion. The use of Nessler’s reagent, or p-dimethylaminocinnamaldehyde, as an indicator 
for the indirect detection of urea present abundantly in human urine is also used success- 
fully (Ong et al., 2012). In some cases, creatinine and Tamm-Horsfall glycoprotein 
(THP) are also estimated for urine detection; however, such tests can also give false- 
positive results for urine samples of other species. SEM-EDX, high-performance liquid 
chromatography, and electrospray ionization liquid chromatography-mass spectrometry 
(LC-MS) are widely used to confirm the detected human urine samples. 


Sweat 

Though sweat is found in rare instances in crimes, it shows huge potential as a source of 
DNA to be used in the individualization process. The composition of sweat and urine is 
nearly the same except containing a little amount of urea and creatinine. Identification of 
relative concentrations of sodium, phosphorus, sulfur, chlorine, potassium, calcium, and 
other metal ions using SEM-EDX can indicate the presence of sweat in a sample. Analysis 
of such samples shows a characteristically huge peak of chlorine in comparison to other 
ionic peaks, suggesting the sweat origin of this sample (Seta, 1977). However, this test is 
not confirmatory and does not give any clue about the species origin of the sample. 
Multidimensional spectroscopic analysis by Raman spectroscopy shows spectral compo- 
nents from lactate, lactic acid, urea, and single amino acids, suggesting the usefulness of 
this technique in sweat detection (Sikirzhytski et al., 2012). 


Molecular techniques 

Semen 

The detection of cell-specific gene expression has shown huge potential for confirmatory 
identification of human seminal fluid. Protamines, exclusively expressed in the haploid 
genome during spermatogenesis, have been explored for the detection of dried semen 
stains (Bauer & Patzelt, 2003). Analysis of three semen-specific mRNA markers, i.e., 
Protamine-1 (PRM1), protamine-2 (PRM2), and semenogelin-1 (SEMG1), in 50 sam- 
ples showed their exclusive detection in semen samples with no cross-reactivity with the 
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blood sample. Further, this study demonstrated the PRM1 marker to be the most reliable 
marker, followed by PRM2 and SEMG1, for human seminal fluid detection (Nader 
et al., 2015). In this regard, a multiplex mRNA profiling system has been proposed 
that includes semen-specific markers such as PRM2 (protamine 2), TGM4 (transglutami- 
nase 4), and SEMG1, besides simultaneous detection of other human body fluids (Song 
et al., 2015). Ina recent study, relative quantification of miRNAs such as miR-144-3p, 
miR-451a-5p, miR-888-5p, miR-891a-5p, miR-203a-3p, miR-223-3p, and miR- 
1260b was useful in discriminating four different types of body fluids (Fujimoto et al., 
2019). Circular RNAs, the products of backsplicing of premRNAs, have also shown 
huge promise in human seminal fluid detection with markers such as TGM4, KLK3, 
and PRM2 (Liu et al., 2019). Analysis of methylome data on 16 samples showed 
cg23521140 and cg17610929 CpG sites to be novel DNA methylation markers for 
semen samples, suggesting these markers to be more effective in human seminal fluid 
detection (Park et al., 2014). In cold cases with limited availability of RNA and the non- 
suitability of DNA methylation analysis, the analysis of copy number variation (CNV) of 
certain tissue-specific DNA markers has also been studied for human body fluid detec- 
tion. Analysis of CNV of mtDNA and telomerase repeat length has been reported to 
be most suitable for human seminal fluid detection through qRT-PCR (Zubakov 
et al., 2018). Genetic markers have shown huge potential in differentiating human sem- 
inal fluid from other body fluids; however, the examiner/analyst should undertake the 
technique based on their expertize and infrastructure availability. 


Blood 

Human immunoglobulin G (IgG) is the most common immunoglobulin in human 
blood and is present continually (Gonzalez-Quintela et al., 2008). Though, irrespective 
of the species variation, the structure of IgG is similar, the presence of certain species- 
specific epitopes mostly the glycosylation of IgG (Shantha Raju et al., 2000), makes 
them suitable target entities for the detection of human blood for forensic applications. 
Besides immunoglobulins, a variety of serum proteins, such as human HbAO, HLA an- 
tigens, and muscle-specific beta-enolase have been explored for human-specific detec- 
tion of blood (Hurley et al., 2009). 

Specific mRNA molecules can also be targeted in the evidentiary samples for specific 
body fluid detection, depending on the cell type and its function. B-spectrin and porpho- 
bilinogen deaminase genes and their respective mRNAs have been established to 
specifically target human blood (Juusola & Ballantyne, 2005). In another study, delta- 
aminolevulinate synthase 2 has been reported as a suitable mR NA marker for the detec- 
tion of peripheral blood (Richard et al., 2012). In this regard, the matrix 
metalloproteinase-11 (MMP-11) gene, which codes for proteins involved in extracellular 
matrix breakdown, is highly useful in detecting menstrual blood. The MMP1 gene is 
found to be upregulated considerably at the time of menstruation in comparison to 


NGS-based detection 


the other phases of the menstrual cycle; hence, targeting its mRNA can conclusively 
distinguish between the peripheral blood and menstrual blood (Counsil & McKillip, 
2010). 

Protein biomarkers are one of the most promising targets for human body fluid 
detection. The advantage of protein biomarkers over other techniques is their diversity, 
stability over time, and tissue-specific posttranslational modifications. In this regard, 
alpha-1-antitrypsin, complement C3, hemoglobin subunit beta, and hemopexin have 
been explored for their suitability in human peripheral blood detection and calpastatin 
for human menstrual blood detection at an efficacy level of 96% (Legg et al., 2014). The 
4%—12% cross-reactivity of these markers has been attributed to blood from flossing 
teeth, urinary tract infections, the presence of minor vaginal abrasions, or residual men- 
strual fluid in the vaginal canal. In another study, protein biomarkers glycophorin A and 
hemoglobin B have been found useful in detecting human blood by dot blot assay in 
aged and simulated forensic evidence with high specificity and sensitivity (de Beyer 
et al., 2018). 


Vaginal fluid 
For a forensic biological examiner, it becomes immensely difficult to distinguish vaginal 
fluid from other similar-looking body fluids such as nasal secretions, saliva, urine, semen, 
and sweat. Many biomarkers have been studied for the proper detection and identifica- 
tion of vaginal fluid as forensic evidence. Human small proline-rich protein 3 (SPRR3) 
isoforms and human fatty acid-binding protein 5 (FABP5) with an acetylated N-terminal 
region lacking the initiator methionine residue have been explored for high-fidelity 
detection of human vaginal fluid ([goh et al., 2015). Multiplex analysis targeting 
ESR1, SERPINB13, KLK13, CYP2B7P1, and mucin 4 (MUC4) genes has been found 
useful for the identification of vaginal fluid (Akutsu et al., 2017, 2020). In a novel study, 
17B-estradiol (E2-17B), SPRR3, and FABP5 have been targeted to determine human 
vaginal fluid by GC-MS and LC-MS (Igoh et al., 2015; Sakurada et al., 2008). 
RT-PCR-based detection of human beta-defensin 1 and MUC4 markers in mRNA 
has been widely accepted for the identification of vaginal fluid (Cossu et al., 2009a). Be- 
sides, E2-17b, which has the strongest estrogenic activity of all the female hormones has 
been detected by GC-MS confirming the biological fluid as vaginal secretion. ELISA- 
based tests utilizing E2-17b monoclonal antibodies have also shown usefulness for this 
hormonal-based detection of human vaginal fluid (Virkler & Lednev, 2009). 


Saliva 

Saliva, the complex biological fluid secreted by acinar cells of the major and minor sali- 
vary glands plays an important forensic role. Besides individualization using saliva sam- 
ples, it can also be used for other forensic applications such as an indicator of salivary 
gland conditions, toxicology, and drug monitoring (Chatterjee, 2019). As amylase is 
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one of the major constituents of saliva, most of the techniques explore the detection of 
a-amylase to identify this body fluid. However, most of these tests cannot differentiate 
between a-amylase 1 present in saliva, and a-amylase 2 present in semen and vaginal se- 
cretions. With the advent of technology, many molecular markers and techniques have 
been discovered to detect human saliva as a body fluid with increased forensic 
importance. 

Many miRNA markers, such as miR-26a, miR-96, miR-135b, miR-138, miR-141, 
miR-145, and miR-182, have been explored for exclusive detection of human saliva in 
forensic samples (Mishra et al., 2021). In this regard, Silva et al. (2016) suggested BCAS4 
as an important epigenetic marker for human saliva detection. Salivary RNA is being 
speculated as one of the most useful biomarkers for the detection of human saliva. 
Though the detection of salivary mRNA is an emerging field, both the RT-PCR tech- 
nique and gel electrophoresis technique have been used for the detection of human 
saliva, targeting statherin and histatin 3 as indicators (Juusola & Ballantyne, 2005). 
Most of the salivary protein markers are used for the detection of diseases. However, 
they can be envisioned to detect this body fluid with forensic relevance. The most 
commonly used protein biomarkers for the detection of saliva include glyceraldehyde- 
3-phosphate dehydrogenase, beta-actin, and ribosomal protein S9 (Lee et al., 2011). 
Several other protein markers, such as alpha-amylase 1, mucin-5B, and proline-rich pro- 
tein HaellI subfamily 2 protein are also explored for the determination of saliva and saliva 
stains (de Beijer et al., 2018). Such molecular markers for the identification of human- 
specific saliva are still being explored for their suitable usability in forensic detection. 


Urine 

Urine is one of the majorly encountered biological fluids found at the crime scene. Be- 
sides the purpose of identification, urine is considered to be the most useful body fluid for 
toxicological and drug detection over blood or hair (Usman et al., 2019). Few markers 
provide huge genetic polymorphism in human urine than those in blood serum, such 
as uropepsinogen, DNase I, and DNase II. Considering the huge content of these com- 
pounds in urine, they can also be speculated as suitable markers to detect this body fluid 
(Yasuda et al., 1997). A study on nuclear DNA and mitochondrial DNA from urine sam- 
ples showed a higher recovery of mitochondrial DNA even after 4 months of storage 
(Castella et al., 2006). Due to the superior usability of urine samples, detection of the 
origin of this body fluid becomes imperative from a forensic investigation point of 
view. In this regard, two RT-PCR-based genetic markers, i.e., renin and uroplakin 2, 
have been recommended as the two most suitable markers for human urine detection 
in addition to their huge diagnostic value (Matuszewski et al., 2018). TMP, a constituent 
present in human as well as other animals’ urine has also been recommended as a suitable 


marker for urine detection (Harbison & Fleming, 2016). Several protein markers, such as 
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osteopontin and uromodulin are also speculated to be useful for the detection of human 
proteins using ELISA and immunochromatographic tests (Legg et al., 2014). Other 
epigenetic markers used for the diagnosis of prostate cancer, such as transmembrane pro- 
tease serine-2 and ETS-related gene fusion, can also be used for the detection of urine 
(Dimitriadis et al., 2013). 


Sweat 

Dermcidin (DCD), a small antibiotic peptide secreted into human sweat is considered a 
suitable marker for the detection of sweat. Sakurada et al. (2010) evaluated the perfor- 
mance of both RT-PCR and ELISA assays for the detection of DCD and found that 
RT-PCR could detect sweat using DCD in 7-day-old stains containing as small as 
10 pL of sweat. Besides, mucin-5B and cathepsin D have also shown their potential as 
useful markers for the detection of human sweat samples (de Beijer et al., 2018). Cystic 
fibrosis transmembrane conductance regulating (CFTR) proteins present in sweat glands 
are responsible for the transport of Na and Cl in epithelial-secreting cells. Modifications 
in this protein are responsible for cystic fibrosis disease determination (Jadoon et al., 
2015). Thus, besides the detection of the disease, the CFTR marker can be speculated 
to be a useful marker for the detection of human sweat. Another sweat-specific protein 
marker, G-81, produced by the monoclonal antibody technique, has been found useful 
to specifically detect human sweat samples (Sagawa et al., 2003). 


Rapid techniques 


With the advancement in technology, many rapid tests for body fluid detection have 
been discovered over the years. In addition to the screening tests, some noncatalytic pre- 
sumptive tests have been developed for different body fluids. For example, Heme Select, 
ABAcard, HemaTrace, and Hexagon OBTI are being used to identify primate blood. 
Out of all these rapid techniques, Hexagon OBTI has been found to perform well on 
aged and degraded blood samples (Spalding et al., 2003). The use of a lateral flow strip 
containing antiglycophorin A has been used recently for the rapid detection of human 
blood by identifying sialoglycoprotein (Reich, 2008). Similarly, a new commercially 
available rapid stain identification semen test has been discovered that targets semenogelin 
(Sg) in semen protein by using antihuman Sg antibodies. This technology of using immu- 
nochromatographic membrane assay technology is found to be more sensitive by detect- 
ing semen at 1:100,000 dilutions (Pang & Cheung, 2007). Nanotrap Sg was also recently 
developed as a rapid test kit for the detection of human semen stains with the same ef- 
ficiency (Laffan et al., 2011; Sato et al., 2007). 

Besides, many rapid tests have also been developed for the detection of human saliva. 
A lateral flow strip method using nine antibodies against human salivary amylase has been 
developed for the detection of human saliva. This technique has been found useful in 


435 


436 


Next Generation Sequencing (NGS) Technology in DNA Analysis 


different challenging samples, such as buccal swabs, plastic bottles, plastic mugs, ceramic 
mugs, cigarette butts, and soda cans (Reich, 2007). Another advanced method available 
in the form ofa spray 1s also available for colorimetric detection of saliva, called SALIgAE. 
Though this technique is much useful in field conditions, it was found to be less sensitive 
in comparison to other available rapid tests for human saliva (Li, 2008). Similarly, rapid 
testing techniques from Galantos Genetics are available for the detection of human urine 
from a variety of samples. Rapid detection of other body fluids, such as human-specific 
vaginal fluid and sweat, needs to be developed to have a complete array of rapid tech- 
niques for human-specific body fluid detection. 


Microbial-based body fluid detection 

A human body contains around 10'* microbial cells. Human body fluids were earlier 
considered to be sterile, whereas studies have also shown the presence of different micro- 
organisms in different body fluids). Many commensal and pathogenic microbes 
commonly inhabit specific body parts, contributing to their presence in different body 
fluids. A list of signature microflora found in different body fluids is given in Table 22.1. 
The advantage of microbes is their abundance and stability. As different microorganisms 
are found in different body parts, a subset of genetic markers targeting the dominant 
microflora can be used to determine the body fluid. 

With the advent of molecular biology techniques, the analysis of microflora DNA 
from different fluids has shown promising results in the detection of these body fluids 
(Giampaoli et al., 2012). A recent study showed the use of 16S rRNA and rpoB as target 
regions not only to detect saliva as the body fluid but also to distinguish between the 
saliva of two different individuals (Leake et al., 2016). Another study on 140 Korean in- 
dividuals showed the suitability of the OB mRT-PCR technique targeting the combi- 
natorial detection of three oral bacteria, Streptococcus salivarius, Streptococcus sanguinis, and 
Neisseria subflava, for the identification of human saliva (Jung et al., 2018). Similarly, anal- 
ysis of the 16S rRNA gene of Lactobacillus crispatus, Lactobacillus jensenii, and Atopobium 
vaginae was found to be useful for the detection of human vaginal fluid (Akutsu et al., 
2012). In this context, a microbial-based RT-PCR kit (ForFLUID kit) has been devel- 
oped to discriminate among human vaginal, oral, and fecal samples (Giampaoli et al., 
2014). 

A taxonomy-independent deep neural network approach based on massively parallel 
microbiome sequencing using the 16S rRNA gene showed differentiation of human 
blood from four different body sites, such as menstrual, nasal, fingerprick, and venous 
blood. The level of accuracy of this technique for the detection of different body fluids, 
such as 0.992 for menstrual blood, 0.978 for nasal blood, 0.978 for fingerprick blood, and 
0.990 for venous blood, showed the usefulness of microbial analysis for the detection of 
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Table 22.1 Signature microflora found on different tissues/body fluids for use in body fluid 


identification. 


Body 
fluids/ 
SI. no. tissues 
1 Vaginal 
fluid 
2 Bloodstains 
3 Saliva 
4 Semen 
stains 
5 Menstrual 
blood 
6 Sweat 
7 


Signature microflora 


Veillonella atypical, 
Lactobacillus 
crispatus, 
Lactobacillus gasseri, 
Atopobium vaginae 


Proteobacteria, 
Actinobacteria, 
Firmicutes, 
Bacteroidetes 

Streptococcus salivarius, 
Streptococcus 
sanguinis, and 
Neisseria subflava 


Staphylococcus sp. 
Corynebacterium sp. 

Lactobacillus crispatus, 
Lactobacillus gasseri 

Corynebacterium sp. 
Staphylococcus sp. 
Propionibacterium 


sp. 

Firmicutes sp. 
Bacteroidetes sp. 
Actinobacteria sp. 
Fusobacteria sp. 
Proteobacteria sp. 


Genetic markers for 
identification 


Multiplex PCR of 
16S rRNA gene, 
16—23S rDNA 


16S rRNA gene 
sequencing 


16S rRNA and rpoB 
PCR, 16S—23S 
ITS, 
streptococcal- 
specific PCR- 
DGGE 

16S rRNA gene 
sequencing 

Multiplex PCR of 
16S rRNA gene 

16S rRNA gene 
sequencing 


Culture based 
approach 
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blood as well as its origin to the part of the body (Lopez et al., 2020). However, contra- 


dictory findings were reported by Dobay et al. (2019), as vaginal and menstrual blood 


share common microbial signatures, resulting in their indistinguishability using the mi- 


crobial approach. Hence, though the microbial markers are a promising avenue for 


body-fluid detection/tissue differentiation, a thorough study is required before applying 


this technique in real-time forensic case solving. 
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Technological advancements in microbial forensics 


Technological advancements in the microbial analysis of biological samples have reached 
the genomic and postgenomic eras. In this regard, most of the studies focus on the 
sequencing of 16S rRNA genes in bacteria. Studies involving the analysis of the micro- 
biome in body fluids and tissue samples such as saliva, skin, peripheral blood, menstrual 
blood, feces, and semen have used this technique to achieve sufficient discrimination po- 
wer to distinguish among tissue types (Sen & Harbison, 2021). Many studies have advo- 
cated the use of a whole-genome sequencing (WGS) approach for highly accurate data in 
body fluid detection using the microbial approach; however, the per-sample analysis cost 
is a huge drawback to this approach (Ranjan et al., 2016). 

Besides the sequencing of the 16S rRNA gene or whole genome, many other tech- 
niques are also being used for microbial-based body fluid detection. In this regard, 
microarray-based detection of 16S rRNA, GroEL, and 18S rRNA genes of several mi- 
croorganisms present in feces and other forensically relevant fluids has opened a new era 
for the utilization of this advanced technique for the detection of human body fluids 
(Quaak et al., 2017). Application of the RT-PCR technique has also been found useful 
in the detection of body-fluid-specific microorganisms. RT-PCR-based detection of 
three oral bacteria with 10°—10’ copies/reaction was found useful in the detection of hu- 
man saliva (Jung et al., 2018). With the advent of next-generation sequencing (NGS) 
technology, microbial-based body fluids detected are also deemed real using this most 
advanced technology (Yang et al., 2014). 

Though detection of body fluids using microbial markers is a promising field, apph- 
cations of microbial markers in tracing geolocation, personal identification, determina- 
tion of biological sex, manner and cause of death, and postmortem time interval 
determinations have also shown huge promise in the field of forensic microbiology 
(Robinson et al., 2021). However, many challenges still exist in adopting microbial- 
based body fluid detection. Variation of inter and intra-individual microflora suggests 
that microbial communities are strongly influenced by many factors, such as geography, 
time, season, health, diet, genetic factors, anti-inflammatory drugs, and the lifestyle of an 
individual. Hence, such parameters need to be evaluated before considering a sample for 
analysis and detection. Another important drawback of the microbial-based approach is 
its omnipresent nature. Rapid contamination of microorganisms in a biological sample 
makes the analysis more complicated to distinguish between endogenous and exogenous 
microflora. Thus, a thorough study on these aspects needs to be carried out before devel- 
oping a microbial-based technique for human body fluid detection. The detailed pros 
and cons of available techniques with potential use in body fluid identification have 


Table 22.2 A comparative account of the techniques used for the detection of body-fluids with their respective pros and cons. 


Sl. 


no. 


Techniques Body fluid/targets 


Alternate light source 
secretions, tears, 
urine and saliva 


Blood, semen, saliva, 
urine 


Catalytic tests 


Peripheral blood 
semen, menstrual 
blood, vaginal 
stains, saliva 


Microscopy 


Blood, semen, vaginal 


Advantages 


Nondestructive 
Efficient 
Simple 
Noninvasive 
Allows visualization of 
nonvisible stains or stains 
on dark surfaces 


Simple 
Fast results 
Highly sensitive 
Few tests are 
nondestructive in nature 
Inexpensive 
Easy to use 
Reliable 
User friendly 


Limitations 


Limited to use by skilled 
personnel possessing 
the knowledge of 
wavelength selectors 
Low specificity 
Poor results when 
applied on dark fabrics 
that have been washed 
Reported to give false- 
positive result 
Cannot predict human 
specificity 

False-positive 
Chemicals like 
benzidine are known 
carcinogens 
Generally presumptive 
test in nature 

For semen gives false- 
negative results in 
natural azoospermic 
individuals and 
vasectomized people 
Can be invasive 
Sample can’t be reused 
for other tests 
Time-consuming 
Using SEM coupled 
with EDX for 
preliminary screening 
of body fluid can be a 
huge burden of 
resources on the 
laboratory 


References 


Aparna and Iyer 


2021 
et al. 


2006 


An et al. 


An et al. 


, Miranda 
2014), 


Santucci et al. 
1999). 
Vandenberg and 
van Oorschot, 


(2012) 


(2012), 


Santucci et al. 
(1999), Scimeca 
et al. (2018) 
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Table 22.2 A comparative account of the techniques used for the detection of body-fluids with their respective pros and cons.—cont’d 


Sl. 


no. Techniques Body fluid/targets 


Advantages 


Limitations 


References 


Human peripheral 
blood, human 
menstrual blood, 
semen, vaginal 
secretions, urine, 
saliva, sweat 


Protein markers 


mRNA markers Peripheral blood, 
menstrual blood, 
semen, saliva, 


vaginal secretions 


Human specific 


Diversity of markers 
Stability over time 
Tissue-specific 
posttranslational 
modifications 

Can be useful in detecting 
body fluid in aged forensic 
samples with high 
specificity and sensitivity 
Mixed samples can be 
analyzed 


High specificity 


High sensitivity mRNA 
markers are stable in nature 
(remains in dried samples 
even if samples are aged) 
Exhibits successful and 
reliable amplification in 
much older stains (16- 
year-old blood stains) 
Possibility of simultaneous 
extraction of mRNA and 
DNA from the same stain 
as sample is limited 
Allows detection of several 
body fluids in one 
multiplex reaction 


Cross-reactivity 
In a mixture of body 
fluid, detection of the 
alpha-amylase 1 
biomarker in saliva gets 
interfered by the 
presence of both blood 


and semen 


Heat and humidity appear 
to be detrimental to 
RNA stability 
False-positive results in 
saliva sample 
RNA-based assays still 
suffer from the fact that 
RNA is less stable that 
DNA because of the 
ubiquitously present 
RWNases 


de Beijer et al. (2018), 
Igoh et al. (2015), 
Lee et al. (2011), 
Legg et al. (2014), 
Sakurada et al. 
(2008) 


Cossu et al. (2009b), 
Hanson and 
Ballantyne (2013), 
Juusola and 
Ballantyne (2005), 
Nussbaumer et al. 
(2006), Virkler and 
Lednev (2009), 
Zubakov et al. 
(2008) 


Ovy 


siskjeuly WNQ Ul! ABojouYd2a| (SON) Bulduanbas UoNelaUay 1XeN 


DNA methylation Blood (venous), saliva, 
markers semen, vaginal 
fluid, urine, 
menstrual blood 
Spectroscopic Peripheral blood, 
techniques vaginal fluid, 


semen, saliva, urine, 
sweat, menstrual 


blood 


High sensitivity and 


specificity 

Requires minute amount 
of sample 

Highly applicable and 
robust assay 

Analysis of multiple tissues 
in a single test 

Numerical, user- 
independent results 
DNA-based assay does not 
necessarily consume 
physical material (if 
multiplexed) because it 
targets extracted DNA 
Methylation loci and STR 
typing can be analyzed in a 
single reaction using 
commercially available 


STR profiling kits 


Accurate 


Sensitive technique 
Specific 
Nondestructive 
Highly reliable 

Rapid 

Objective 
Eco-friendly 

No or minimal sample 
preparation required 


Requires skilled expertize 
Some methylated 
sequences might not 
have restriction 
enzyme sites; 
methylated sequences 
that are not present in 
restriction sites will not 
be detected 
Natural variability of 
methylation levels and 
stochastic PCR effects 
need to be considered 
Genomic DNA can be 
degraded during 
bisulfite treatment; a 
bisulfite-based 
methylation SNaPshot 
assay may consume 
more samples than 
MSRE-PCR 

If the sample is 
contaminated it cannot 
be analyzed 
‘Water can create 
interference 
Liquid samples can’t be 
analyzed due to 
hydroxyl peaks in the 
amide regions 
Many conditions can 


Frumkin et al. (2011), 
Gomaa et al. 
(2017), Lee et al. 
(2016) 


Sharma et al. (2020) 
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Table 22.2 A comparative account of the techniques used for the detection of body-fluids with their respective pros and cons.—cont’d 


Sl. 
no. Techniques 


Body fluid/targets 


Advantages 


Limitations 


References 


microRNA markers 


Circular RNAs 


Microbial technique 


Nasal secretions, 


venous blood, 
saliva, semen, 
vaginal secretion, 
urine, menstrual 


blood 


Human seminal fluid, 


human blood, 
saliva, vaginal 
cellular material, 
menstrual blood 


Vaginal fluid, 


peripheral blood, 
saliva, semen, 
menstrual blood, 
sweat, urine, feces 


Sensitive and specific 


approach 

Stability in samples over 
long periods 

Intrinsically small size of 
miRNAs makes them less 
prone to degradation by 
environmental factors 


These molecules are closed 


circular structures so they 
are very stable 

Potential to identify fluids 
in degraded samples 


Abundance of microflora 


Stability 

Signature microflora 
Commensals in different 
tissue types 


interfere with the 


spectral results such as 


water submersion, 
sunlight exposure, 
heating, and rust 


Difficult to discriminate 
fluids with similar traits 


such as saliva and 
vaginal secretion 


Vaginal cellular material 
and menstrual blood 


could not be 
distinguished 


Vaginal and menstrual 
blood share common 


microbial signatures 


Per sample analysis cost 


is high 


Variation of interand 


intraindividual 
microflora 


Omnipresent nature of 


microbes 


Fujimoto et al. (2019), 


Rocchi et al. (2020) 


Liu et al. (2019), 


Memcezak et al. 
(2015), Sijen and 
Harbison (2021) 


Castillo et al. (2019), 


Giampaoli et al. 
(2014), Neugent 
et al. (2020), 
Peterson et al. 
(2016), Yao et al. 
(2020) 


coy 


siskjeuly WNQ Ul! ABojouYda| (SON) Bulduanbas UoNelaUay 1XeN 


Rapid techniques 
(RSID kit, 
nanotrap sg, 


SALIgAE, etc) 


Quantum dot 
molecular beacon- 
based. techniques 


Immunological tests 


ELISA method 


Human saliva, 
primate blood, 
semen, menstrual 


blood 


Blood, semen, saliva 


Human blood, semen, 


saliva, menstrual 


blood 


Saliva, vaginal fluid, 
urine, sweat 


Fast and efficient 


The on-site test can be 
done 

Performs well on aged and 
degraded samples 


Rapid technique 


Confirmatory test 

Less time consuming 
Human-specific 
Requires less training as 
compared to PCR-based 
analysis 


Highly sensitive 


Convenient and rapid 
Highly specific and easy to 
use 

Results are easy to 
interpret 

The sample material 
remains suitable for DNA 
extraction and profiling 


Low cost 


High sensitivity and 
specificity 


Cross-reactivity 
Low sensitivity 


Difficulty in multiplexing 


False positives results are 
common 
Nonspecificity 


Sensitivity and specificity 
is dependent on the 
antigen used 


Laffan et al. (2011), 
Old et al. (2012), 
Rahman et al. 
(2020), Sato et al. 
(2007), Spalding 
et al. (2003) 

Young et al. (2017) 


An et al. (2012), 
Holtkotter et al. 
(2018) 


Legg et al. (2014), 
Quarino et al. 
(2005), Virkler & 
Lednev (2009) 
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Conclusions 


Body-fluid detection is one of the most important aspects of forensic analysis. Though 
the primary goal of analyzing any biological sample is individualization, the same sample 
should undergo its own detection of origin. Hence, the forensic community is in need of 
a technique that can simultaneously work for body fluid detection as well as individual- 
ization. Besides, forensic samples are always precious. Hence, nondestructive tests for 
body fluid detection are preferred. Biological samples, including body fluids, are mostly 
found at the crime scene. Before collecting these evidentiary materials, the investigating 
agency likes to confirm the nature of these body fluids/stains at the crime scene itself: 
Hence, the forensic science community is also in requirement of portable techniques 
for body fluid detection at the crime scene. The available techniques for body fluid detec- 
tion are either not confirmatory or not specific to human origin. Hence, the most suitable 
technique for body fluid detection should possess characteristic features such as noncross 
reactivity, human-specificity, portability to the crime scene, and a confirmative nature. 
With the advancement of technologies, many methodologies have been discovered 
over the years to detect human-specific body fluids. However, all the techniques 
described here have their respective pros and cons. Though molecular methods of 
body fluid detection are still in their developing stages, they have shown huge promise 
for their applicability in crime scene and laboratory-based investigations of body fluids. 


List of abbreviations 


ALS Alternate light sources. 

ATR-FTIR Attenuated total reflection Fourier transform infrared spectroscopy. 
BPPs Bipyrimidine photoproducts. 

circRNAs Circular RNAs. 

CNV Copy number variation. 

DMAC p-dimethylaminocinnamaldehyde. 

DNA Deoxyribo nucleic acid. 

ELISA Enzyme-linked immunosorbent assay. 
ESI-LC-MS Electrospray ionization liquid chromatography-mass spectrometry. 
GAPDH Glyceraldehyde-3-phosphate dehydrogenase. 
HPLC High-performance liquid chromatography. 
LDH Lactate dehydrogenase. 

mfDNA Microflora DNA. 

NGS Next-generation sequencing. 

PAS Periodic acid-Schiff. 

PRM1 Protamine-1. 

RT-PCR Reverse transcription PCR. 

SAP Seminal acid phosphatase. 

SEM Scanning Electron Microscopy. 

SEMG1 semenogelin-1. 

THP Tamm-Horsfall glycoprotein. 

VAP Vaginal acid phosphatase. 


NGS-based detection 
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Introduction 


Investigative Genetic Genealogy (IGG) is the approach of combining genetic analysis 
with traditional criminal investigations. It is an emerging science that examines single 
nucleotide polymorphisms (SNPs) in the human genome. The terms next-generation 
sequencing (NGS) and massively parallel sequencing (MPS) describe the newest technol- 
ogies that sequence DNA and RNA in a more rapid and cost-effective manner (Borsting 
and Morling, 2015). These sequencing methods have brought about the use of different 
types of genetic markers, such as SNPs, that can be used in forensic genetic analysis. The 
use of SNPs as additional genetic markers has led to larger genetic datasets and grown 
alongside new stronger computational methods and automation, as well as the compila- 
tion of additional and larger DNA databases. Decades of these advancements in DNA sci- 
ence and technology have led to IGG becoming one of the most exciting methods used 
to identify relatives of criminal suspects to crack cold cases. IGG is mainly applied for this 
purpose; however, it is also used for other investigative questions. A search result on 
Google for “IGG” yielded nearly 8,320,000 hits January 16, 2023), including research 
studies, articles, news stories, and videos demonstrating its widespread popularity. Ge- 
netic genealogy has also been used in disaster victim identification cases and missing per- 
son identification. Many of these involve the need for large-scale genetic profile 
comparisons (Chen et al., 2019). 

The IGG process employs searching genetic data from privately owned direct-to- 
consumer (DTC) companies, which store millions of DNA profiles of customers, and 
dense SNP data consisting of half a million markers to investigate distant relationships 
(Machado and Silva, 2022). The degree of relatedness is parameter that can be set by in- 
vestigators to narrow down or widen the search. DTC genetic databases empower IGG, 
but not all DTC companies are on board with allowing law enforcement access due to 
customer privacy and other commercial concerns. This makes some of them unavailable 
for use in IGG. At present, an imbalance exists between the growing scientific usefulness 
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of IGG in forensic casework by the police and prosecution and the actual number of 
times it is used. 

Nonetheless, IGG with NGS has become a powerful tool used in forensic casework 
to solve cold cases, especially in more recent years and after its use in solving high-profile 
cases. The well-known case of the ‘Golden State Killer’ was solved by a ancestry search in 
a DTC database owned by GEDMatch after the unidentified killer committed a number 
of murders and more than 50 sexual assaults (SA) over two decades in the 1970 and 1980s 
(Wickenheiser, 2019). His DNA profile was found on different items of evidence 
collected from the crime scene, but investigators could not identify him due to the 
nonavailability of his DNA in any law enforcement databases. The GEDMatch database, 
originally launched in 2010, was used in 2018 by a California police officer who ran a 
genealogical search and identified a third cousin match with the culprit’s DNA profile 
from crime scene evidence. The profile match was found to be that of a third cousin 
of Joseph James DeAngelo, a breakthrough after 40 years. By tracing through his family 
tree, DeAngelo was narrowed down as a suspect. His sample was taken as a reference, and 
when the DNA profile was generated, it matched the evidence sample profiles obtained 
in past cases. The investigation led to his apprehension, and the case led to the expansion 
of the use of various ancestry-based commercial databases worldwide. Aside from the 
DeAngelo case, this study collected available data from 65 public news report articles 
on 80 additional cold cases recently reported to have been solved through the use of 
IGG and NGS [5—69]. The goals was to summarize all reported cold cases solved using 
NGS and IGG. It highlights the importance of IGG’s current use in solving cold cases and 
its potential for solving even greater numbers of them and more of them before they turn 
cold. 


Methods 


In order to obtain an overview and assess all reported cold cases solved with NGS or IGG, 
a list of such cases was compiled from all such reported cold cases solved. Multiple 
internet search engines were utilized to search for the keywords “cold cases solved 
with...” followed by the words “NGS”, “IGG”, “investigative genetic genealogy”, 
“forensic genealogy”, and “DNA”, all in separate searches. “DNA solve” and related 
phrases were also searched. The primary sources found were news reports and news ar- 
ticles, with a few other sources such as science and law journals. This revealed 80 addi- 
tional solved cases, which were then reviewed for additional significant data of interest, 
which was gathered and compiled out of the news articles and reports. Each news article 
was used to collect data on both the perpetrators and victims. Data collected on the per- 
petrators includes their name, reported gender, race, age, the type of crime or crimes they 
committed, when they were committed, the geographical location where they were 
committed, the year they were arrested and brought to justice, and crimes solved 
(Table 23.1). Corresponding data was also collected about the victims (Table 23.2). 


Table 23.1 Information on perpetrators of cold cases solved with investigative genetic genealogy. 


Name Gender Race State* Year Crime type Date solved 
1 Steve Branch M G AK H, SA 2020 
2 Lesa Lopez F C CA Filicide 2020 
3 Alan Edward Dean M C WA Kidnapping, H 2020 
4 Michael Allan Carbo Jr M C MN H, SA 2020 
5 William Baer M C FL H 2020 
6 Robert Lynn Bradley M C FL H 2020 
7 Julius William Hill Jr M C FL H 2020 
8 Daniel Nyqvist M C Sweden H 2020 
9 Daniel Alan Anderson M AA DE H 2020 
10 Phillip Wilson M AA CA H, SA 2020 
11 Terry Rasmussen M C NH H 2020 
12 James Curtis Clanton M C CO H, SA 2020 
13 Robert Dale Edwards M C CA H 2020 
14 Bruce Lindahl M C IL H 2020 
15 Unnamed M UN NC H 2018 
16 Charles Gary Sullivan M C AZ H, SA 2019 
17.1 Horace Van Vaultz Jr. M AA CA H, SA 2019 
17.2 Horace Van Vaultz Jr. M AA CA H, SA 2019 
18 Linda LaRoche F C WI H 2019 
19 Brook Graham F C SC H 2019 
20 John Arthur Getreu M C CA H 2019 
21 Frank Wypych M C WA H, SA 2019 
22 Jeffrey Lynn Hand M UN IND H, SA 2019 
23 Richard Knapp M C Vancouver H 2019 
24 Terrence Miller M C WA 2019 
25.1 Cecil Stan Caldwell M C MT H 2019 
25.2 Cecil Stan Caldwell M C MT H 2019 
26.1 Coley McCraney M AA AL SA, H 2019 
26.2 Coley McCraney M AA AL SA, H 2019 
Continued 
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MD 
FL 


Year 


1976 
1976 
1994 
1989 
1984 
1977 
1979 
1973 
1993 
1986 
2006 
1990 
1992 
1988 
1973 
2007 
2001 
2010 
1997 
1990 
1990 
1990 
1999 
2009 
2016 
2016 
1981 
1986 
1987 
1987 


Crime type 


H 

SA, H 
SA, H 
SA 
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Date solved 


2019 
2019 
2019 
2019 
2019 
2019 
2019 
2019 
2019 
2019 
2019 
2019 
2019 
2018 
2018 
2018 
2018 
2019 
2018 
2018 
2018 
2018 
2020 
2018 
2018 
2018 
2018 
2018 
2018 
2018 
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53 
54 
55 
56 
57 
58 
59 
60 
61.1 
61.2 
61.3 
61.4 
62 
63 


Gary Young 

Leroy Jemol Smith 

Mark Dauglas Burns 
Gregory Paul Vein 

Gilles Warrick 

Johnnie B.Greens 

Roy Charles Waller 
William Louis Nicholas 
Marlon Michael Alexander 
Marlon Michael Alexander 
Marlon Michael Alexander 
Marlon Michael Alexander 
Darold Wayne Bowden 
Spencer Glen Monnett 


See SSS 555505858 8 


OOF ESE OCR OCR 


AZ 
OK 
UT 
CA 
SC 
NC 
NC 
FL 
MD 
MD 
MD 
MD 
NC 
UT 


AA, African American; C, Caucasian; H, homicide; SA, sexual assault; UNK, Unknown. 


“For cases outside of the USA, city or country is reported. 


1981 
2005 
1992 
1997 
1990 
2009 
1991—2006 
1983 
2010 
2010 
2011 
2007 
2006 
2018 


SA 
SA 
SA 
SA 
SA, H 
SA 
SA 
SA 
SA 
SA 
SA 
SA 
SA 
SA 
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2020 
2020 
2019 
2019 
2019 
2019 
2019 
2019 
2019 
2019 
2019 
2019 
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Table 23.2 Information on victims of cold cases solved with investigative genetic genealogy. 
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Name Gender 


Jessica Baggen 

Baby John Doe 
Melissa Lee 

Nancy Daugherty 
Saad Kawaf 

Wanda Deann Kirkum 
Carolyn Cox Rose 
Anna—Lena Svensson 
John Muncy 

Robin Brooks 
Marlyse Honeychurch 
Helene Pruszynski 
Naomi Sanders 
Pamela Maurer 
Reesa Trexler 

Julia Woodward 
Mary Duggan 

Selena Keough 
Peggy Lynn Johnson 
Baby Boy 

Janet Taylor 

Susan Galvin 

Pam Milam 

Audrey Hoellein 
Jody Loomis 

Clifford Bernhardt 
Linda Bernhardt 
Tracie Hawlette 
J.B.Beasley 

David Schuldes 


Race 


NK 


NK 
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State® 


AK 
CA 
WA 
MN 
FL 
FL 
FL 
Sweden 
DE 
CA 
NH 
CO 


CA 
Montclair 
IL 
SC 
CA 
WA 
IN 
OR 
WA 
MT 
MT 


AL 
AL 
WI 


Age 


17 

0 

15 

38 
est.45 
18 

47 

56 

15 
20 

24 

21 

57 

16 

15 

21 

22 

20 

18 
Newborn 
21 

20 
18—22 
26 

20 

20 

20 

17 

17 
25 


Year 


1996 
1988 
1993 
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1999 
1991 
1978 
2004 
1983 
1980 
1978 
1980 
1973 
1976 
1984 
1979 
1986 
1981 
1999 
1989 
1974 
1967 
1972 
1994 
1972 
1973 
1973 
1999 
1999 
1976 


Type of assault 
H, SA 
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Date solved 


2020 
2020 
2020 
2020 
2020 
2020 
2020 
2020 
2020 
2020 
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2019 
2019 
2019 
2019 
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2019 
2019 
2019 
2019 
2019 
2019 
2019 
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31 Ellen Matheys F C WI 24 1976 SA, H 2019 
32 Le Bich-Thuy, F A MD 42 1994 SA, H 2019 
33 Unknown F UNK MD 52 1989 SA 2019 
34 Pamela Cahanes F C FL 25 1984 H 2019 
35 Bynn Riney F C CO 27 1977 SA, H 2019 
36 Carol Anderson F C CO 16 1979 SA, H 2019 
37 Linda O’Keefe F C CA 11 1973 SA, H 2019 
38 Sophie Sergie F A AK 20 1993 SA, H 2019 
39 Anna Marie Hlavka F C OR 18 1986 SA, H 2019 
40 Scott Martinez M H OR 47 2006 H 2019 
A Jack Upton M Cc AZ 30 1990 H 2019 
42 Christy Mirack F C PA 25 1992 SA, H 2019 
43 April Tinsley F C IN 8 1988 SA, H 2018 
44 Leslie Perlov F (3 CA 21 1973 H 2018 
45 Jodine Serrin F C CA 39 2007 SA, H 2018 
46 Christine Franke F C FL 25 2001 H 2018 
47 Michael A. Temple M C MD 24 2010 H 2018 
48 Ann Smith F C GA 28 1997 H 2018 
49 Pamela Faye Felkins F C KS 32 1990 SA, H 2018 
50 Betty Jones F C MS 65 1990 SA, H 2018 
51 Jenny Zitricki F C SC 28 1990 SA, H 2018 
52 Deborah Dalzell F C FL 47 1999 SA, H 2020 
53 Holly Cassano F C IL 22 2009 H 2018 
54 Constance Gauthier F C MA 81 2016 H 2018 
55 Virginia Feeman F C TX 40 1981 SA, H 2019 
56 Michella Welch F C WA 12 1986 SA, H 2018 
57 Tanya Van Cuyleborg F C WA 18 1987 H 2018 
58 Jay Cook M C WA 20 1987 H 2018 
59 Unknown F UNK AZ NA 1991 SA 2020 
60 Unknown F UNK OK NA 1993-95 SA 2020 
61 Unknown F UNK CA NA 1997 SA 2019 
Continued 
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Table 23.2 Information on victims of cold cases solved with investigative genetic genealogy.—cont’d 


Name Gender Race State* Age Year Type of assault Date solved 
62 Christine Mirzayan F C 1998 2019 
63 Unknown F UN 2009 2019 
64 Unknown F UN 2006 2019 
65 Unknown F UN 1983 2019 
66 Unknown F UN 2007 2019 
67 Unknown F UN 2010 2019 
68 Unknown F UN 2010 2019 
69 Unknown F UN 2011 2019 
70 Unknown F UN 2007 2019 
71-76 Unknown X6 F UN 2006—08 2019 
77 Unknown F UN 2018 2018 
78 Lisa Roberts F C 1977 2020 
79 Paul Fronczak M C 1964 Kidnapping 2019 
80 Horace Jack M UNK 1976 Unknown 2019 


H, homicide; SA, sexual assault; AA, African American; C, Caucasian; UNK, Unknown. 
“For cases outside of the USA, city or country is reported. 
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Application of next-generation sequencing technology and investigative genetic genealogy to solving cold cases 


Multiple crimes committed by the same individual were denoted with the same number 
with a dash for each serial numerical count. Because many of the perpetrators committed 
multiple crimes on multiple victims, the number of perpetrators is less than the number of 
victims listed, with 63 perpetrators caught and 80 victims’ cases solved. 

A given news article’s publication date was used as an estimate of the date on which 
the crimes were solved, which is one potential limitation of the accuracy of that value in 
this research. Likewise, all information and any results interpreted from the sources in this 
study rely on and are limited by the accuracy of the news articles reporting on the solved 
cold cases. The information was analyzed using pie charts and bar graphs to perform com- 
parisons based on the categories of data collected. Comparisons are shown for the per- 
centage of crimes committed by the reported gender of the perpetrator (Fig. 23.1), 
race of the perpetrator (Fig. 23.2), geographical location (Fig. 23.3), gender of victims 
(Fig. 23.4), and length of time in years that the crime went unsolved (Fig. 23.5). 
Repeated offenses by the same individual were counted separately in statistics totaling 
the number of crimes. Not all of the victims and perpetrators were accounted for in 
each of the statistics because, in some cases, the name, gender, or other details of the 
crimes remain unknown. For determining any percentiles or other comparisons in any 
cases with these unknowns, the unknowns were omitted from the count and not 
included in the calculation. Crimes by category were tabulated in Fig. 23.6. 


Results 


From the gathered news information, the oldest cold case solved was the kidnapping of 
Paul Fronczak, which went unsolved for over 55 years. The data is shown in Figs. 


Crimes Committed by Gender 
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Figure 23.1 Gender comparisons by number and percentage for perpetrators of crimes reported 
solved. 
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mC wAA mUnknown 
Figure 23.2 Comparison of perpetrators race by number of crimes reported solved. 


Victims by Reported Gender 


a Males” a» Females 
Figure 23.3 Gender of victims by number count and percentage. 
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Crimes by Geographical Location 
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Figure 23.4 Geographical location of crimes by US state/province and country. 


23.1—23.6. Other than the single kidnapping case, the cold cases were all sexual assault 
(SA) and homicide (H) cases, where 95% (n = 59) of cases the crimes were committed 
by males, specifically Caucasian males 75% (n = 55), African American males 22% (n= 
16), and 3% of unknown ancestry (n = 2) (Table 23.1 and Figs. 23.1 and 23.2). Out of 
the total number of assailants (n = 63), five white males and two African American males 
were repeat offenders involved in sexual assault cases, homicides, or both. The remain- 
ing 5% (n = 2) of cold cases were committed by Caucasian female assailants, which were 
strictly linked to homicides. The geographic locations of the crimes are shown in Fig. 
23.4. The higher case distribution in certain states, such as Maryland was affected by 
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meGW gJG mRW gaWN @MA BMA BMA OMA @DB @SM 
Figure 23.5 Comparison of lengths of time crimes remained unsolved. 


the serial offenses committed by a single individual. More cold cases were solved overall 
in large states including California and Florida. The victims (n = 80, ages ranging from 
age () to 86, with an average age of 29.4 and standard deviation of 18.61 years) (Table 
23.2 and Fig. 23.3) were gathered using only available news information. Females 
comprised 66 victims and accounted for 85% of the victims, with 95% of the victims be- 
ing Caucasian females, while there were also two Asian females, and one Hispanic female 
victim. In some cases, variables of interest were unavailable and, as such, were not 
included in analysis. For example, due to privacy concerns the surviving victim’s 
name, age, and race were often undisclosed. However, the living victims’ genders 
were counted and applied in the comparisons, as well as the category of offense. Fig. 
23.5 shows the time it took to solve a case. It took 29.1 years on average to link a perpe- 
trator to a crime, with a standard deviation of 13.36 years . It is noted that all cold cases in 
this study were solved in the years 2018, 2019, and 2020 and do not account for cases that 
might have been solved more recently. The crime categories were subdivided into ho- 
micides (H) (n = 30, 40%), sexual assault (SA) (n = 13, 17%), homicide with sexual as- 
sault (H with SA) (n = 31, 41%), kidnapping (n = 1, 1.1%), and unknown categories 
(n = 1, 1.1%) (Fig. 23.6). 


Discussion and conclusion 


Investigative genetic genealogy has become an important tool for unlocking cold cases. It 
is estimated that GEDmatch data contains over 1.2 million individual profiles, with a 90% 
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Crimes by Category 
1, 1% 1,1% 


eSA aH «&SA,H_ wKidnapping # Unknown 
Figure 23.6 Comparison of type of crimes by number and percentages. 


chance to identify at least one-third of the cousins of anyone in the population (Khan and 
Mittelman, 2018). Each cold case described in the articles was based on DNA profiles that 
were closely matched to the crime scene or evidence through genetic association. The 
generated data demonstrates that crimes solved by these methods are heavily skewed to- 
ward white male perpetrators and young white female victims. The majority of crimes 
were sexually motivated homicides, which gives some explanation for the majority of 
victims being dominantly female. The fact that white males are the dominant perpetrators 
in these solved crimes is possibly explained by the fact that DTC customer databases have 
primarily Caucasians as customers, making the majority of genetic data received and held 
by DTCs from Caucasians. This logically leads to catching more perpetrators that are 
family members of Caucasians. Due to a lack of sufficient evidence, it remained unclear 
in several repeat offense cases whether all their victims were sexually assaulted before be- 
ing murdered. Serial killing offenses were also committed mainly by white males. One 
black male sexually assaulted six women in Maryland but did not murder them. The 
geographic distribution of crimes was disproportionally affected by the serial offense 
cases, as opposed to being based on population, socioeconomic, or other factors related 
to location. All listed cases were solved in the years 2018, 2019, and 2020, with the oldest 
case remaining unsolved for 55 years and the newest solved in the same year of the 
offense. The trend of old cases remaining unsolved has significantly decreased in recent 
years. This is likely, at least in part, due to the growth and increased scientific and tech- 
nological ability to solve crimes with these databases, enhanced by the ongoing growth 
and developments in NGS, including the ability to use an ever-growing number of 
different types of genetic markers such as SNPs and InDels. The trend is also likely 
due in part to the genetic database GEDmatch being purchased by Verogen Company 
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in 2019, which then dedicated the platform to forensic investigatory uses (Erlick et al., 
2018). Recently, Verogen has ensured that anyone contributing to GEDmatch is explic- 
itly warned that criminal investigators or other people just interested in genealogies will 
be able to perform DNA comparisons against their data (Greytak et al., 2019). The tool 
now opens the database to law enforcement if the customers receive this warning and still 
choose to use the service and contribute their genetic data for this purpose. Additionally, 
privacy concerns are reduced by the fact that no one is legally required to contribute to a 
genetic genealogy database. It is purely voluntary, and all the DNA data samples are not 
in the possession of any government agencies. The majority of success in catching crim- 
inals with IGGs on DTC databases has been in cold cases. However, genetic genealogy 
has just as much ability to generate leads in active cases. In 2018, it was used to identify 
and arrest a perpetrator in an active sexual assault case that occurred only 3 months prior 
to the arrest. The individual arrested pled guilty in court in 2019. In January 2023, it was 
utilized to lead to the arrest in Pennsylvania of a suspect in the murder of four Idaho col- 
lege students (Sun et al., 2022). While this suspect has not yet been brought to court and 
found guilty, it shows that investigators can now have access to modern forensic DNA 
technologies that can quickly generate significant new leads rather than waiting until 
all other means of investigation are exhausted. IGG with NGS can prevent cases from 
ever going cold as well as continue to solve increasing numbers of unsolved cold cases. 
To help with this, opening DTCs with large databases to police access, despite privacy 
and other concerns will aid in solving morecases. Police and law enforcement, along 
with academies, universities, and private DTC companies working to develop their re- 
lationships with each other and to form standard procedures and protocols, will facilitate 
more familiar interactions with one another and enable law enforcement greater and 
quicker access for searches. Police searches of these DNA databases have strongly shown 
their potential to solve some of the most serious crimes in a quick time. Additional con- 
sumer contributions to the databases from all races and ethnic groups make facilitate cases 
being solved for more victims. 
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Forensic genetics: timeline of technology evolution 


DNA sequencing was first described by Maxim, Gilbert, and Sanger et al. in 1977 (Maxam & 
Gilbert, 1977; Sanger et al., 1977). Subsequently, this methodology of analysis evolved into 
Next-Generation Sequencing (NGS) techniques, which, to date, are also known by the term 
Massive Parallel Sequencing (MPS). An MPS technique is defined by the National Cancer Insti- 
tute’s Dictionary of Genetic Terms as: “A high-throughput method used to determine a 
portion of the nucleotide sequence of an individual’s genome.” This technique uses DNA 
sequencing technologies that are capable of processing multiple DNA sequences in parallel. 

DNA NGS sequencing techniques have undergone significant developments over 
the years, which have allowed us to subdivide these analysis methodologies into first, sec- 
ond, and third generation sequencing. A variety of important developments in first, second, 
and third generation sequencing are depicted in Fig. 24.1. 

As can be seen from the timeline, it is clear that the first-generation Sanger sequencer, 
developed in 1977, dominated the market for a long time. In the years 2005—2007 
several second generation systems were launched in the market, such as the Solexa 
Genome Analyzer, from Illumina, and the SOLiD R system from Applied Biosystem. 
The so-called third generation sequencers started with the launch of the Helicos Genetic 
Analysis Platform in 2007(from Helicos BioSciences) that was the first DNA-sequencing 
instrument to operate by imaging individual DNA molecules. In 2014, Oxford Nano- 
pore Technologies (ONT) MinION, which was introduced for early access, is a 
pocket-sized device that applies nanopore sequencing technology to nucleic acid analysis. 

The scaled-up GridION was commercially launched in 2017 and PromethION in 
2018, with the largest device, the PromethION 48, first shipped in 2019 (Nanopore 
sequencing-Oxford Science Park, Oxford, UK). All of these technologies that have 
developed over the years since the Sanger method have increased efficiency and accuracy 
by more than three orders of magnitude (Bruijns et al., 2018). 

In principle, the concepts behind Sanger versus next-generation sequencing (NGS) 
technologies are similar. In both NGS and Sanger sequencing, DNA polymerase adds 
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Figure 24.1 Timeline of the most important developments regarding NGS. (From Bruijns, B., Tiggelaar, 
R., Gardeniers, H. (2018). Massively parallel sequencing techniques for forensics: A review. Electrophoresis, 
39(21), 2642—2654. doi: 10.1002/elps.201800082. Epub 2018 Aug 22. PMID: 30101986; PMCID: 
PMC6282972.) 


fluorescent nucleotides one by one onto a growing DNA template strand. Each incor- 
porated nucleotide is identified by its fluorescent tag. The critical difference between 
Sanger sequencing and NGS is sequencing volume. While the Sanger method only se- 
quences a single DNA fragment at a time, NGS is massively parallel, sequencing millions 
of fragments simultaneously per run. This process translates into sequencing hundreds to 
thousands of genes at one time. 

With each step, more sophisticated bioinformatics tools, programs, and software for 
DNA sequencing have provided more automation and higher throughput. In recent 
years, several parallel sequencing (NGS) methods have become available, allowing 
large-scale production of genomic sequence, and the number of human genomes 
sequenced with such instrumentation has increased rapidly. 


First generation sequencing 


Sanger sequencing was developed in 1977 by Frederick Sanger, who was awarded the 
Nobel Prize in Chemistry in 1980 (Sanger et al., 1977). This method, which is now 
known as the first-generation sequencing, is rather similar to PCR, because an ssDNA 
template, a DNA primer, DNA polymerase, and deoxynucleotide triphosphates (dNTPs) 
are also required to perform the reaction. Furthermore, di-deoxynucleotide triphos- 
phates (ddNTPs) are needed, which can be incorporated into the newly synthesized 
DNA strand, just as normal dNTPs, but with termination of the elongation process as 
a result. Therefore, this method is also known as the chain termination method. The re- 
action is carried out in fourfold, whereby in each tube (besides the normal dNTPs) only 
one of the four labeled ddNTPs is added in a relatively low concentration. Nowadays the 
sequence can be determined by fluorescently labeled dd NTPs and capillary (gel) electro- 
phoresis (ibidem). 
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Second generation sequencing 


The development of PCR in 1985 has led to major improvements in instrumentation, 
because alternatives for the principle of Sanger sequencing became available. The new 
systems such as 454, SOLID R, and Ion Torrent TM make use ofa cell-free system, whereas 
for Sanger sequencing bacterial cloning of DNA fragments was required (Kumar, 2012; 
Shendure & Ji, 2008; Van Dik et al., 2014). 

Compared to Sanger sequencing, the second generation NGS techniques allow us to 
obtain a huge amount of data (sequences) in less time and with lower costs. This allows us 
to analyse multiple targets at the same time (up to the entire genome of an organism) of 
the same sample, analyse multiple samples at the same time for the same targets or even 
for different targets between samples, produce millions of sequencing reactions in parallel 
compared to tens or hundreds of Sanger and not to requery the bacterial cloning of DNA 
fragments as it is based on cell-free systems for the preparation of libraries. 

There are different types of platforms (e.g., Roche, Thermo Fisher Scientific, Ilu- 
mina) each with different tools (e.g., MiSeq and HiSeq for Illumina or SOLID and 
Ion Torrent for Thermo Fisher Scientific). Each has its own distinctive features, but 
they all share the commonality of using DNA or cDNA (RNA) libraries as input for 
sequencing. Two primary MPS platforms are used in forensic DNA analysis: (1) MiSeq 
FGx Forensic Genomics Systems (Illumina, San Diego, CA, USA) and (2) Ion Tor-rent 
PGM or Ion S5 (ThermoFisher Scientific, Waltham, MA, USA) (Haddrill, 2021). 


Third generation sequencing 


Although second generation sequencing techniques are based on amplification, with the 
next-next-next (or third generation) methods single molecules are read in real time. 
Therefore, these techniques are much faster and longer reads can be generated then 
with the previous generations of sequencing techniques. Single molecule sequencing 
(SMS) technologies can be grouped in three categories. The first category is SBS 
methods, whereby single molecules of DNA polymerase are observed at the moment 
they are synthesizing a single DNA molecule (e.g., PacBio R). Nanopore sequencing 
techniques are the second category (e.g., Oxford Nanopores) and the third category con- 
sists of methods that use direct imaging of individual DNA molecules by means of 
advanced microscopy (e.g., VisiGen) (Bruijns et al., 2018). 


Current limitations and problem solving in forensic genetics 


To date, the technology used for short tandem repeat (STR) marker typing is based on 
multiplex PCR followed by capillary electrophoresis and is well established in all forensic 
genetics’ laboratories. In contrast, MPS platforms require a different analytical workflow 
and new forensic bioinformatics resources that are currently not widely available due to 
the enormous amount of data generated. In fact, it must be considered that millions of 
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reads per sample can be obtained from a single sequencing experiment conducted with 
MPS methodology. Traditional STR analysis by gel electrophoresis or capillary electropho- 
resis (CE) estimates the number of STR repeats by the size of the PCR amplicon. In 
contrast, analysis with NGS allows determination of the entire sequence of the PCR 
product, including the STR repeat region and surrounding flanking areas. Numerous 
studies have shown that sequence-based STR analysis, as opposed to amplicon size- 
based analysis, results in a pronounced increase in allelic variability for many forensically 
relevant STRs (Gettings et al., 2017). Even loci that appear to show no additional 
sequence variation in individual populations can be informative in others, as demon- 
strated by a study of four major population groups in the USA where only TPOX of 
the commonly used autosomal loci failed to show added sequence variation in any of 
the investigated populations (Phillips et al., 2018). Capturing the additional sequence 
variation that may be present in the STR repeat and flanking regions can have many ad- 
vantages beyond simply improving discrimination power when using these markers for 
direct matching or relationship calculations. For DNA mixtures, the improved discrim- 
ination of the markers will in itself be very advantageous for identifying extra sequence- 
specific alleles that would have been previously masked due to identical CE amplicon 
lengths, while the enhanced discrimination will also make it less likely that an individual 
will have an adventitious match between their profile and the alleles present within a 
mixed trace. 

One notable advantage of analysis by MPS rather than CE is that separation by ampli- 
con size is no longer a requirement for multiplex STR assay design; this means that all 
STRs can be amplified using the smallest feasible amplicon length, improving amplifica- 
tion in cases of degraded DNA, and that the number of markers amplified simultaneously 
is no longer constrained by the fluorescent dye detection capabilities of electrophoretic 
systems hence heralding the possibility of coamplifying increasing numbers of STRs at 
the same time. One use of this would be to allow coamplification of large panels of Y 
and autosomal STRs, providing increased power for male/female mixed stain analysis, 
e.g., following sexual assault (Ballard et al., 2020). 

However, the amplification of forensic DNA samples by MPS is not limited to STR loci 
though, and SNP markers can also be analysed either in combination with STRs or on their 
own. The advantages of analysing SNPs for forensic samples with MPS rather than alter- 
native technologies such as SNaPshot or SNP array systems, rest predominantly in the fact 
that it is possible to target large (compared with historical SNP systems used in forensic ge- 
netics) numbers of markers simultaneously from low quantities of DNA. It is also possible 
to target microhaplotypes with NGS, that is, sets of SNPs located in very close proximity to 
each other on a chromosome, and these microhaplotype markers have shown promise in 
forensics for both identification and ancestry purposes (Wendt et al., 2017). Moreover, the 
fact that the entire PCR amplicons are analysed when typing SNP markers with NGS 
means that any observed variation in the PCR flanking regions will also in effect turn 
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the targeted SNP into a microhaplotype. One example of this is reported by Wendt et al. 
(2017). when studying a Native American population; they found that 22 of the 94 iden- 
tity SNPs studied contained flanking region variants in the tested population of which 14 
were informative (i.e., not in total linkage with one specific allele) the fact that some iden- 
tity SNPs are actually microhaplotypes is to be expected and the additional information 
that this extra variation adds will be valuable for all applications. 

Another important factor to consider regarding NGS methodology is that currently 
there are a few NGS bench-top sequencers that dominate the forensic genetics landscape: 
Illumina’s MiSeq FGx, ThermoFisher’s Ion Torrent PGM, and Ion S5. While Illumina imple- 
ments cycle-based sequencing technology coupled with a reversible termination strategy 
of fluorescently labeled modified dNTPs, conceptually similar to Sanger sequencing, the 
Ion Torrent is a semiconductor sequencer that measures pH changes as consequence of the 
release of hydrogen ions during synthesis of DNA. While base substitutions are the most 
common sequencing errors generated during sequencing on Illumina machines, inser- 
tions and deletions are the most frequent errors introduced by the Ion Torrent PGM. 
In the case of the latter, homopolymer stretches longer than 6 bp are the most difficult 
to call, since the correlation between the incorporated nucleotides and the change in 
detected voltage is not exactly to scale. Although the error rates generated by the Ion 
Torrent are higher (=1%) and DNA library preparation protocols can be more time- 
consuming and cumbersome in comparison with the MiSeq workflow, the lack of op- 
tical scanning and cycle-based sequencing significantly reduces the time of DNA 
sequencing. One notable advantage of the Ion Torrent platforms is the availability of 
automated library preparation and chip loading stations, which simplifies the workflow 
considerably. Recently, similar automation and liquid handling solutions have been pro- 
posed for many MiSeq workflows (Laurent et al., 2017). 

Among the existing commercial solutions the ForenSeq DNA Signature Prep Kit (Illu- 
mina, CA) is the first commercially available STR kit for the MiSeq FGx that allows for 
amplification of up to 153 (DNA primer mix A) or 231 (DNA primer mix B) loci simul- 
taneously. The multiplex assay includes 27 forensic autosomal STRs, 24 Y-STRs, 7-X 
STRs, the Amelogenin sex marker, and either 94 or 172 SNPs depending on the multi- 
plex formulation. The 172 SNPs can be divided into informative identity, geographical 
ancestry, and phenotypic SNPs (Ballard et al., 2020). 


NGS: from diagnostics to forensic applications 


Next-generation sequencing (NGS) has enabled scientists and clinicians to better under- 
stand the genetic mechanisms associated with disease conditions that lead to altered health 
status and the development of serious diseases such as cancer and infectious illnesses. To 
date, NGS has become essential in application areas such as: medical diagnostics, forensic 
genetics, biotechnology, and virology. 


475 


476 


Next Generation Sequencing (NGS) Technology in DNA Analysis 


These advances have resulted in improved diagnostics in the clinic for early interven- 
tion and monitoring of treatment response, ensuring patients receive the best possible 
therapies. 

Use of NGS in clinical diagnosis is now widely accepted, with varying roles from gene 
panel or targeted sequencing, through whole exome sequencing (also termed clinical exome 
sequencing), to whole genome sequencing. NGS was first applied to clinical diagnosis when 
clinical exome sequencing was launched in 2012. The neurology community, although 
recognizing its technical limitations, including nonuniform coverage and the difficulty in 
sequencing or analyzing certain genomic regions and lack of ease in detecting certain 
types of variants, still believed the new technology had benefits to offer. The first scien- 
tific reports showed that clinical exome sequencing was able not only to successfully identify 
both rare and common single nucleotide variants and small insertions and deletions across 
coding exons and flanking splice sites of genes, but also to achieve a higher diagnostic rate 
than most of the traditional molecular tests, such as single gene sequencing, small gene 
panels, or chromosomal microarrays for rare Mendelian disorders. 

Furthermore, diagnostic rates between laboratories were surprisingly consistent 
despite differences in the capture kits used, bioinformatics pipelines, and patient cohorts. 
Initially, variant confirmation was done with Sanger sequencing, which was, and still is in 
some estimations, considered the gold standard. However, laboratories soon began to 
develop internal quality measurements to determine the necessity of confirmation for 
each variant. 

We now know that there are genomic regions, and conditions in which variant abun- 
dance is relevant to disease, for which NGS is a better method than Sanger sequencing, 
raising the question of whether Sanger sequencing should be replaced as the gold stan- 
dard (Haddrill, 2021). 

Additionally, the Next-generation sequencing (NGS) methods effectively allow all types 
of nucleic acids to be sequenced, using a whole genome or targeted approach, with 
DNA, mRNA, and small RNA sequencing as standard analyses. Furthermore, larger- 
scale sequencing of specific RNA subtypes, such as long noncoding RNAs and snoRNA, 
as well as methylated DNA, have become possible with the introduction of NGS. The 
prospect of simultaneously analyzing a large number of markers such as STRs and SNPs 
in parallel with targeted mRNA and small RNA analysis makes MPS a very powerful, 
relatively easily applicable, tool in forensic laboratories (Fig. 24.2) (Ballard et al., 2020). 

In particular, the relevance to forensic analysis is that in addition to generating stan- 
dard STR profiles, DNA repeats can also be sequenced to look for single nucleotide poly- 
morphisms (SNP5). 

Moreover, additional SNPs can be sequenced to acquire ancestry, paternity, or 
phenotype information. 

Current MPS systems are also very useful in cases where DNA has been acquired 
from a crime scene so it may be available in limited quantities or is present in a highly 


Troubleshooting and challenges of Next-generation sequencing technology in forensic use 477 


we : 
( ~ Mixture occurs ) Allele frequencies 
(celts from muttiple 
= Number of contributors 
Lb nreelnc 
Sample collected 

(recovery wa CS! swab) 

List of weighted genotype Likelihood Ratio (LR) 
JL possibilities produced from calculated (based on 
mixture deconvolution propositions, reference 

Data obtained profiles, and pop. data) 

(extraction, quart, PCR, 

EPG with STR profie) statistical models 
Level ofinputdata. © Probabilistic Genotyping Software (PGS) System Jt 
determined by lab 

eee See Report generated 
Defined by validation studies (LR verbal equivalent provided) 
aoa Ly 
linimmotese nn Trier-of-fact decision made Testimony offered 
i imagen = (considering DNA results with other info) (LR verbal equivalent provided) 
PGS probathatic genotyping software 


Nucleic acid analysis - applications in forensic testing 


Figure 24.2 Most prevalent applications of nucleic acid analysis in forensic testing. (From Ballard, D., 
Winkler-Galicki, J., Wesoty, J. (2020). Massive parallel sequencing in forensics: advantages, is-sues, tech- 
nicalities, and prospects. International Journal of Legal Medicine, 134(4), 1291—1303. doi: 10.1007/ 
5004 14-020-02294-0. Epub 2020 May 25. PMID: 32451905; PMCID: PMC7295846.) 


source 


test 


478 


Next Generation Sequencing (NGS) Technology in DNA Analysis 


degraded form. If, for example, not enough autosomal DNA is present and a lineage 
analysis needs to be performed, mitochondrial DNA can be sequenced, via NGS, for 
maternal lineage analysis. The application of MPS technology also extends to other 
markers such as DNA methylation patterns and mRNA gene expression for identifica- 
tion of body fluids. 

The use of this wide variety of markers is useful for forensic DNA phenotyping (FDP) 
and DNA-based prediction of appearance (eye, hair, and skin color), biogeographic 
ancestry, and age. For this reason, the “Visible Attributes through Genomics - VISAGE” 
(Gross et al., 2021) consortium is developing MPS tools for FDP marker analysis to 
enable prediction of appearance, ancestry, and age of unidentified contributors from bio- 
logical stains. These tools are designed to allow forensic laboratories to implement these 
innovative approaches into their repertory of routine applications for studying case 
samples. 

These developments clearly demonstrate that the use of NGS will become an indis- 
pensable tool for forensic science, and the various components present in a mixture and 
assigning an appropriate weight to the evidence can be challenging (Butler & Willis, 
2020). The complexity of mixed profiles has called for increasingly complex methods 
of mixture interpretation, and there has been a move away from relatively simple 
methods that ranged from determining whether an individual could be excluded as a po- 
tential contributor to a mixture, to the use of likelihood ratio methods that estimated the 
most likely genotype combinations of contributors to a mixture, the more complex of 
which used some of the information contained within profile peak heights (Haddrill, 
2021). 

This has led over the years to an increase in the use of probabilistic genotyping software 
(PGS) to assist DNA mixture interpretation. Generally, PGS systems use either: (1) 
“discrete” (sometimes called “semi-continuous”) models that use the presence or 
absence of peaks along with probabilities of allele drop-out or drop-in or (2) “contin- 
uous” (sometimes called “fully-continuous”) models that take peak heights into ac- 
count as well as the presence or absence of peaks along with probabilities of allele 
drop-out or drop-in. Fig. 24.2 describes the general steps in mixture interpretation 
along with user inputs required for PGS systems. The ability of these programs to anal- 
yse mixtures previously considered too complicated for interpretation has seen rapid 
uptake by forensic laboratories, and publication of studies reporting the developmental 
and internal validation of different probabilistic genotyping software packages, as well as 
guidelines for their use by a number of regulating bodies (Forensic Science Regulator, 
2020; Haddrill, 2021). The software packages that implement probabilistic genotyping 
methods are highly complex, and developers have urged forensic laboratories to ensure 
their analysts have a good understanding of the concepts underlying the methods and 
that they remain involved in the interpretation of profiles and critical evaluation of 
the mixture analysis (Lee et al., 2019). Concerns have been raised over variation in 
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the output of probabilistic genotyping methods, some due to subjective decisions made 
by the user, some due to variability inherent in the methods (Haddrill, 2021). Some 
countries have seen extensive debates over the admissibility of probabilistic genotyping 
methods in court and whether methods have gained general acceptance in the commu- 
nity, but the widespread implementation of these methods into forensic laboratories 
around the world suggests that the response has been positive. These methods also pro- 
vide significant promise for the interpretation of MPS data that can uncover greater 
complexity in mixed profiles by identifying sequence differences between alleles that 
would be indistinguishable by length. 


NGS in forensics and the issue of privacy: a global overview 


The development over the years of increasingly cutting-edge NGS techniques has made 
forensic DNA testing a significant resource for investigation and prosecution activities in 
criminal justice systems around the world. Forensic DNA testing can be conducted in 
several ways: first, by comparing the DNA profiles from criminal suspects to DNA evi- 
dence, so as to assess the likelihood of their involvement in a crime. The second kind of 
use is related to searching for a link between the biological material collected from a 
crime scene and a DNA profile stored in a criminal DNA database. The third form of 
forensic DNA testing is related to procedures to search for criminal suspects through their 
connection with biological relatives. Finally, the inference of human externally visible 
physical features from a biological sample collected at the crime scene (Toom, 2018). 
One prominent aspect of forensic DNA testing is the establishment and expansion of 
centralized national criminal DNA databases. These databases involve the collection, 
storage, and use of DNA profiles from nominated suspects, convicted offenders, victims, 
volunteers, and other persons of interest to criminal investigation work. The primary 
function of a criminal DNA database is to produce matches between individual profiles 
and crime scene stains, which requires a constant input of both. Around 69 countries 
currently operate national forensic DNA databases, and others are being expanded or 
established in at least 34 additional countries (Interpol, 2016). Recent innovations and 
developments in forensic DNA testing in the criminal field are related to techniques 
of Forensic DNA Phenotyping (FDP), the use of ancestry-informative markers, and fa- 
muilial searching. FDP can be described as a set of techniques that aims to infer human 
externally visible physical features—eye, hair, and skin color—and continental-based 
biogeographical ancestry of criminal suspects on the basis of analysis of biological mate- 
rials collected at crime scenes. FDP techniques have been applied in various jurisdictions 
in a limited number of high-profile cases to provide intelligence for criminal investiga- 
tion. Familial searching makes use of procedures to detect genetic relatedness in criminal 
DNA databases to search for criminal suspects through their connection with biological 
relatives (Machado & Silva, 2019). 
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Massively parallel sequencing (MPS), makes possible, in one DNA reaction, the 
collection of autosomal STRs, Y-STRs, mitochondrial markers, and mRNA as well as 
a variety of EVC (externally visible characteristics) and AIM (ancestry informative 
markers) relevant SNPs and “Indels” (or insertions/deletions). More SNPs may also be 
included in the arrays that comprise commercially available MPS kits. It is possible to 
run reactions for STR identification separately, but it is also possible to use the whole 
array if required. Genealogical connections are already routinely interrogated by the use of 
Y-STRs and mitochondrial DNA, but the inclusion of these in MPS kits may make their 
uses more common, although the privacy implications of this increase are insignificant 
when confined to the comparison of DNA collected in the course of criminal investiga- 
tions. Additional considerations may arise if crime-scene DNA is searched against 
commercially managed “recreational” or research genealogical databases (Williams & 
Wienroth, 2017) (such as GedMatch or FamilyTreeDNA). 

What has just been said can be explained by the fact that, since DNA is sensitive data, 
in order to guarantee and protect the right to privacy of every citizen, the police cannot 
use the databases present in genealogy societies on their own initiative unless the citizen, 
who donates his or her DNA to the database, consents to its use in this specific context. 

For this very reason in Maryland, a compromise bill was reached that strikes a balance 
between this very important technology for identifying people who commit serious 
crimes and the related privacy concerns. This condition puts a dampener on the use of 
forensic genealogy in investigations because: first, the bill states that by 2024, genealogists 
working on such cases must be professionally certified; second, investigators can only use 
genealogy companies that have explicitly informed the public and their clients that law 
enforcement uses their databases and have asked for their clients’ consent to participate. 
Currently, GEDmatch and FamilyTreeDNA clients can choose to participate in these 
searches. Finally, investigators may not use any of the genetic information collected, 
whether by the suspect or a third party, to learn about a person’s psychological traits 
or predispositions to disease. Upon completion of the investigation, all genetic and gene- 
alogical records that were created for it must be deleted from the databases. It has there- 
fore been demonstrated, on the one hand, its effectiveness in terms of providing valuable 
information for investigations and for the solution of cases, but on the other hand also the 
strong ethical and legal implications related to privacy. 

Different views on the capabilities, benefits, and risks of forensic DNA testing circu- 
late within modern societies. Supporters of the expansion of forensic DNA testing in the 
criminal justice system invoke its capacity to serve as a valuable law enforcement tool, 
namely by improving efficiency in fighting crime, helping in the prevention of miscar- 
riages of justice and deterrence of criminal activity, which is, in turn, expected to reduce 
crime and increase public safety and security. Critics concerned with potential threats to 
civil liberties argue that forensic DNA testing, in particular the storage of profiles in 
computerized databases operating as forensic DNA databases for criminal identification, 
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may threaten the protection of a range of human rights, in particular liberty, autonomy, 
privacy, informed consent, moral and physical integrity, and the presumption of inno- 
cence (Machado e Silva, 2019). 


Applications of NGS technology in forensic entomology 


Forensic entomology is a branch of the forensic sciences focused on the application of the 
study of arthropods for legal investigations (Byrd & Tomberlin, 2019); medicolegal ento- 
mology, more specifically, focuses on the study of those entomological groups which are 
attracted to decomposing matter (such as a dead body) and which can provide informa- 
tion useful to the investigators, relative, for example, to postmortem movement of the 
remains or minimum time since death (Amendt et al., 2011). For the purpose of this 
chapter, we will focus on the insects of medicolegal interest. 

During a criminal investigation that involves entomological evidence, the specimens 
have to be collected and preserved promptly and correctly, to ensure a correct assessment 
of their developmental stage, as well as to preserve their morphology. This will make it 
possible to obtain a correct identification to the lowest taxonomical level of the specimen 
(ideally species). Being insects poikilothermic creatures, their developmental rate depends 
on the external temperature (Charabidze & Hedouin, 2019), and it is species-specific. 
This means that different species develop at different rates at a given temperature. There- 
fore, to appropriately use insects as indicators of time of colonization (TOC) or minimum 
postmortem interval (mPMI) (Tomberlin et al., 2011) the determination of the species is 
essential. Morphological identification is a common method to identify the specimen 
based on their observable morphological features; this is accomplished through the use 
of dichotomous keys. Dichotomous keys consist of a specific sequence of couplets refer- 
ring to a specific morphological trait, which offers two contrasting options; as one pro- 
ceeds through the couplets, choosing options that apply to the specimen of interest, the 
specimens are identified through a process of elimination (Charabidze & Martin-Vega, 
2021). For some specimens however, morphological identification is not a suitable op- 
tion because of their conditions (damaged specimens) or because dichotomous keys 
are not available (Gemmellaro et al., 2019). In those cases, identification has to be accom- 
plished in different ways. 

In the field of forensic entomology, molecular biology has been majorly employed to 
identify entomological specimens (Stevens et al., 2019). The group to which molecular 
identification has been amply applied is the blowfly family (Diptera: Calliphoridae). 
Blowflies are among the first colonizers of decomposing remains; the adults can locate 
a corpse within minutes, and once they arrive on it, they will lay their eggs on it. 
When these hatch, larvae will eclose and will start feeding on the body voraciously 
(Greenberg & Kunich, 2002). DNA barcoding is a method consisting in sequencing a 
short fragment of DNA and in using this fragment to identify the species it belongs to 
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by comparing it against known sequences stored in DNA libraries. While the term bar- 
coding refers to the process of sequencing one single specimen at the time, metabarcod- 
ing refers to massive parallel sequencing of samples containing more than one individual’s 
DNA (Cristescu, 2014). This method has been used extensively for the identification of 
specimens of forensic interest. 

It is possible to use different sequences to identify the species of a specimen; ITS2 (In- 
ternal Transcribed Spacer) sequence is one example. This is spacer sequence located be- 
tween two subunits of ribosomal RNA (rR NA) and it has been vastly used in taxonomy 
and phylogeny for several taxa, including insects of forensic interest (Gemmellaro et al., 
2019). However, most of the molecular identification methods used in forensic ento- 
mology for the identification of the specimens focus on the COI + II (Cytochrome c 
Oxidase subunits I and II) mitochondrial genes; these are relatively stable haploid genes 
whose sequence (entirely or partially) has been used for the species identification of cal- 
liphorid flies (Stevens et al., 2019). The BOLD database (Barcode of Life Data System) is 
a free public resource where barcode COI sequences of known species are stored 
(Ratnasingham & Hebert, 2007). Currently (2022) the BOLD database contains 
12,126 records for Calliphoridae, and a total of 328 different species (https://www. 
boldsystems.org). Potentially, a specimen recovered on a crime scene that cannot be 
identified morphologically could undergo DNA isolation, PCR, purification, and 
sequencing of the COI region. The sequence could then be uploaded on the BOLD 
database where it would be compared against the known sequences of the library, and 
an identification to the lowest taxonomical level could be possible. 

The revolution represented by the NGS has had a significant impact even in the field 
of forensic entomology. NGS has allowed for the sequencing of genomic DNA without 
the need to know the genome of a specific species. Through the use of NGS, it is possible 
to use DNA fragments to identify fragment of specimens, specimens that have been 
damaged due to the weather conditions a body may have been on, or even specimens 
that are too small to be seen and collected (Bittleston et al., 2016). It is common to 
recover remains colonized by masses of arthropods that cannot be easily separated into 
individual specimens; depending on the conditions of the remains, there may be just frag- 
ments of the arthropods (Chimeno et al., 2019). Without a metabarcoding analysis, the 
identification of these specimens would likely not be possible. 

Another application of NGS in forensic entomology was presented by Farncombe 
et al. (2014), and it involves the use of NGS to screen a partial genomic sequence of 
the Black Blow Fly (Phormia regina Meigein (Diptera:Calliphoridae)) searching for micro- 
satellite loci; this had the purpose to assess the genetic structure of the population and use 
this information to infer corpse movements based on the genotypes of the maggots 
collected on them. 

A particular case where NGS analysis was used in forensic entomology but not for the 
determination of the arthropod species was presented by Pilli et al. (2016). The authors 
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used NGS techniques on human blood extracted from the gastrointestinal tract of lice 
and were able to build a clean human profile that was a perfect match with their reference 
sample. One of the newest applications of NGS to forensic investigations involving ento- 
mofauna is its application to the estimation of the postmortem submersion interval 
(PMSI) or time elapsed between when a body is submerged and its recovery. Hyun 
et al. (2019) analysed the microeukaryotic biodiversity of a pig carcass and a car bonnet 
after their submersion in a reservoir; they compared the relative abundance pattern of 
certain taxa during each of the decomposition stages of the pig to assess potential differ- 
ences that could potentially be used as references to infer the PMSI. 
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Introduction 


Next-generation sequencing (NGS), also known as massively parallel sequencing (MPS), 
is a high-throughput DNA sequencing technology capable of processing millions of 
DNA strands in parallel, yielding a higher resolution than capillary electrophoresis 
(CE), the current gold standard method used for human identification. NGS technology 
has advanced rapidly, and it is causing a huge impact in different areas, including in the 
forensic field. NGS is a robust method that allows sequencing of all types of nucleic acids 
and simultaneous analysis of hundreds of genetic markers, such as short tandem repeats 
(STRs) and single nucleotide polymorphisms (SNPs), and it provides more results in 
faster time (Ballard et al., 2020). The advent of NGS technology presents forensic 
DNA laboratories with the opportunity to solve challenging cases involving degraded 
samples, complex mixtures, or cases with a limited amount of DNA. Also, forensic sci- 
entists are able to simultaneously obtain information on human identification, ancestral 
origin, lineage and kinship, and phenotypic traits (Bruijns et al., 2018). NGS is revolu- 
tionizing DNA analysis and is becoming a powerful tool for the forensic community, 
providing meaningful information and resolving challenges in forensic DNA casework. 

For over 20 years, forensic DNA testing has been based on CE technology, which 
presents limitations when dealing with difficult samples. Often, limited amounts or 
degraded samples pose challenges for CE-based testing, which may lead to inconclusive 
results. NGS offers forensic laboratories a new technology to add to their list of available 
testing methods. The ability to obtain additional information from smaller amounts of 
DNA, as well as from degraded and mixed DNA samples, positions NGS as an indispens- 
able tool for forensic DNA analysis. 

But despite the advantages presented by this new technology, the widespread 
adoption of NGS in forensic analysis has been slow. When compared to other fields, 
forensic science has been slow in fully embracing and incorporating NGS. This is 
mainly due to the new workflow, which differs from CE and requires specialized 
training and validation of the methods for casework. The introduction of NGS tech- 
nology in a forensic laboratory requires changes in sample processing, instrumentation, 
data analysis, and interpretation. Generally, forensic laboratories have limited resources 
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available, and the combination of a large upfront cost of instrumentation with 
increased time and effort for validation and analyst training may be creating a road- 
block for the swift adoption of NGS. Alonso and coauthors (Alonso et al., 2017) pub- 
lished the results of a survey conducted by the DNASeqEx consortium in collaboration 
with the European Network of Forensic Science Institutes (ENFSI) DNA working 
group that collected information on NGS from 33 European laboratories from 25 
countries. Of these laboratories, 52% had purchased at least one NGS instrument by 
2017. The ones that had not purchased any instruments listed the following reasons 
for not doing so: general costs of NGS, lack of funding, and hesitation to use NGS until 
applications were more developed. For those that purchased an instrument, implemen- 
tation challenges were reported, such as a lack of consistent nomenclature and reporting 
standards, a lack of compatibility with existing national DNA database infrastructure, a 
lack of population data for statistical purposes, and a lack of an adequate legislative 
framework (Alonso et al., 2017). 

To accelerate the adoption of NGS technology by forensic laboratories, companies 
manufacturing kits and instrumentation have worked on lowering costs. Also, there 
are now more peer-reviewed publications on NGS-STR population data (Casals 
et al., 2017; Gettings et al., 2018; Hussing et al., 2019; Moura-Neto et al., 2021; 
Novroski et al., 2016; Phillips, Devesse, et al., 2018; Silva et al., 2018, 2020; van der 
Gaag et al., 2016; Wu et al., 2019), NGS testing, and validation work (Brandhagen 
et al., 2020; Buchard et al., 2016; Cihlar et al., 2020; Faccinetto et al., 2021; Hollard 
et al., 2019; Jager et al., 2017; Kocher et al., 2018; Senst et al., 2022; Silvery et al., 
2020). In 2016, the International Society for Forensic Genetics (ISFG) addressed the issue 
of the compatibility between sequence-based STR data and CE-generated genotypes 
and the need for a consistent and standardized nomenclature framework for NGS- 
STR sequences (Parson et al., 2016). Two years later, the DNA Commission of the 
ISFG revised the minimal STR sequence nomenclature requirements and published a 
revised sequence template file and nomenclature format (Phillips, Gettings, et al., 
2018). More recently, the STRAND (STR: Align, Name, Define) ISFG working group 
held a meeting to address the issue of a lack of a universally accepted nomenclature sys- 
tem. Attendees discussed ideas with the goal of progressing on a nomenclature scheme, 
and the summary of the meeting report was published in September 2019 (Gettings et al., 
2019). The ultimate goal of this working group is the release of an official ISFG recom- 
mendation on the requirements for labeling STR sequences. Other developments to 
encourage laboratories to adopt and validate NGS for DNA casework were the publica- 
tion of the revised Scientific Working Group on DNA Analysis Methods (SWGDAM) 
validation guidelines for DNA analysis methods (Scientific Working Group for DNA 
Analysis Methods (SWGDAM), 2016) and the addendum to “SWGDAM interpretation 
guidelines for sutosomal STR typing by forensic DNA testing laboratories” to Address 
NGS (Scientific Working Group for DNA Analysis Methods (SWGDAM), 2019a). 


Validation of NGS for casework at forensic DNA laboratories 


NGS kits and platforms for generating DNA profiles 


Law enforcement agencies around the world have access to DNA profiles stored in na- 
tional and international DNA databases, which usually contain information on STR loci 
and mitochondrial DNA (mtDNA). In the US, for example, the National DNA Index 
System (NDIS) accepts autosomal and Y chromosome STR profiles and mtDNA control 
region data (Federal Bureau of Investigation (FBI), n.d.). Currently, NGS kits can 
generate DNA profiles based on data for generally accepted DNA loci and also on loca- 
tions that are not usually encountered in DNA databases, such as K-STRs and SNPs. 

The ForenSeq DNA Signature Prep kit (Verogen, CA) was the first NGS kit 
commercially available for forensic purposes and also the first NGS-STR kit approved 
for upload to NDIS. This kit is part of an integrated workflow that includes the MiSeq 
FGx sequencing instrument platform and the ForenSeq Universal Analysis Software. 
With this system, investigators can generate DNA profiles for up to 96 DNA samples 
on a single sequencing run. The results include information on autosomal, X-, and Y- 
STRs, as well as SNPs for the purposes of identification and phenotypic and biogeo- 
graphical ancestry information (Verogen, n.d.a). Other ForenSeq kits are also available 
for forensic purposes and are also used with the MiSeq FGx sequencing instrument plat- 
form and the ForenSeq Universal Analysis Software. These kits include the ForenSeq 
mtDNA Whole Genome Kit, for sequencing the whole mitochondrial genome (Vero- 
gen, n.d.e); the ForenSeq mtDNA Control Region Kit, for sequencing the control re- 
gion of the mitochondrial genome (Verogen, n.d.d); the ForenSeq Kintelligence Kit, for 
forensic genetic genealogy purposes (Verogen, n.d.b), and the ForenSeq MainstAY Kit, 
which targets core autosomal and Y-STRs (Verogen, n.d.c). 

The PowerSeq CRM Nested System (Promega, WI) was the first NGS-mtDNA kit 
approved for upload to NDIS. This system generates information on mitochondrial HVI, 
HVII, and HVIII control regions, and it is part of a workflow compatible with MiSeq 
technologies (Illumina, CA) and the GeneMarker HTS Software (SoftGenetics, LLC) 
(Promega, n.d.b). Another PowerSeq product is the PowerSeq 46GY System, a non- 
NDIS-approved kit as of yet that targets autosomal and Y-STR loci (Promega, n.d.a). 

The Applied Biosystems Precision ID mtDNA Whole Genome Panel (Thermo 
Fisher, MA) is also NDIS-approved. However, only the control region results are 
allowed to be uploaded to the database. This NGS panel is part of a system that in- 
cludes the Ion GeneStudio S5 System, the Ion Chef System, and the Converge Soft- 
ware with the NGS data analysis module. The Precision ID mtDNA Control Region 
Panel, which targets hypervariable (HV) regions I, II, and III, is also available. Other 
panels include the Precision ID Ancestry Panel containing SNPs that can provide 
biogeographic ancestry information; the Precision ID Identity Panel, which includes 
34 upper Y-clade SNPs and 90 autosomal SNPs; and the Precision ID Global Filer 
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NGS STR Panel containing autosomal STRs and Y markers (Thermo Fisher Scientific 
Applied Biosystems, n.d.). It is worth noting that these last four cited kits are not NDIS- 
approved at this moment. 


Milestones for NGS in forensic casework 


As forensic laboratories get more exposed to NGS and become more acquaintance with 
its advantages, some important milestones are being set by different laboratories around 
the world that are implementing parallel sequencing in their casework. France, for 
example, started to analyze DNA samples with NGS technology for mixture deconvo- 
lution, analysis of complex kinship, and obtention of the genetic profile from highly 
degraded DNA. In 2018, during the GCC Forensics Exhibition and Conference, Dr. 
Francois-Xavier Laurent presented how the Institut National de Police Scientifique 
(INPS), in France, implemented a fully automated workflow using NGS technology 
(Laurent, 2018). Dr. Laurent, at that time, was the head of research and development 
in forensic genetics at INPS, which is the biggest forensic institute in France and com- 
prises a network of five accredited forensic laboratories. Dr. Laurent presented a 2011 
cold case where a conclusive result could not be obtained with the use of CE results 
alone. The ForenSeq Prep Kit and the MiSeq FGx system (Verogen) were then used, 
and additional autosomal STR data was recovered for two low-level minor male contrib- 
utors. With this case, in February 2018, France became the first country to upload an 
NGS profile to a national DNA database. 

The use of NGS technology was crucial to solving a sexual assault case in The 
Netherlands in 2019. This was the first criminal conviction based on DNA profiles ob- 
tained with an NGS system (Verogen, n.d.f). In 2015, a 28-year-old woman was sexually 
assaulted. Evidence from the crime was submitted for analysis, and CE was used to 
generate DNA profiles. A hit was found in the Dutch convicted criminal database, 
and a suspect was arrested. The defense lawyer challenged the interpretation of the CE 
results, and the suspect was later released by the judge. The analyzed sample contained 
mixed DNA, and based on only CE results, it was not possible to determine if some al- 
leles were stutters from the victim’s profile or if they belonged to a minor contributor. 
After an appeal by the prosecution, the use of Verogen’s NGS technology was allowed, 
and it was performed by the Forensic Laboratory for DNA Research (FLDO), an inde- 
pendent forensic DNA laboratory at Leiden University Medical Center. This time, results 
were clear due to a great advantage of NGS: investigators not only had information on 
the size of each allele, but they could also analyze their sequences. They were able to 
identify more of the minor alleles, and after the use of likelihood ratio statistics, a conclu- 
sive result was obtained. After this first court case, FLDO, which was the first forensic 
laboratory to receive an ISO-17025 accreditation for NGS technology, has used this 
method in more than 30 cases. Investigators were able to generate profiles for some 
cold cases, and for other ones, NGS-generated profiles allowed potential suspects to 
be excluded. 


Validation of NGS for casework at forensic DNA laboratories 


Recently, Finaughty and coauthors (Finaughty et al., 2020) reported a case from 2018 
where two sets of human remains (A and B) were discovered within 2 months of each 
other off the coast of Cape Town, South Africa. Both sets were unidentifiable due to ma- 
rine decomposition, and preliminary analysis by the forensic pathologist suggested that 
the remains of set A belonged to a female individual. DNA analysis was then performed 
to determine the biological sex of the remains and if sets A and B belonged to the same 
individual. DNA profiles were obtained using both CE and NGS analyses, and according 
to the results, the remains belonged to the same male individual and not a female, as pre- 
viously suggested. Although the results obtained using the conventional CE method and 
NGS were concordant, sequencing analysis provided higher random match probability 
values. According to the authors, this was the first forensic case in Africa where NGS 
and the MiSeq FGx system were successfully used and assessed. Also, this was the first 
report of successful DNA profiling with an NGS system using soft tissue lysates from ma- 
rine decomposition. 

Back in 2019a, Verogen announced a partnership with Cellmark Forensic 
Services, a provider of forensic DNA services in the United Kingdom (UR), to establish 
Cellmark as a UK center of excellence for forensic NGS (Verogen, 2019a). As part of 
the new collaboration, Verogen would supply Cellmark with its ForenSeq DNA Signa- 
ture Prep Kit. The goal of this collaboration was to provide the capabilities of NGS to 
police investigators, such as simultaneous analysis of different DNA markers to generate 
data on identification, appearance, and biogeographical ancestry, and the potential to 
recover more information from degraded and mixed samples. Two years later, in 
September 2021, Cellmark became the first forensic laboratory to receive ISO17025 
accreditation from the UK Accreditation Service for forensic sequencing services (Cell- 
mark Forensic Services, n.d.). 

In the United States, important milestones were reached when the FBI 
approved forensic NGS systems to be used by laboratories for the generation of 
DNA profiles for NDIS. In May 2019, the MiSeq FGx system (Verogen) became 
the first NGS NDIS-approved system for STR profiling. This was achieved by the 
submission of internal validation results to the FBI by four different laboratories: 
the Washington, D.C., Department of Forensic Sciences (DFS), the Armed Forces 
DNA Identification Laboratory (AFDIL), the University of North Texas Health 
Science Center (UNTHSC), and the Ohio Bureau of Criminal Investigation 
(OBCI) (Verogen, 2019b). Also in May 2019, the PowerSeq CRM Nested System 
(Promega), which is used for generating mtDNA profiles, was approved for submission 
to the US NDIS Combined DNA Index System (CODIS) database (Promega, n.d.b). 
In July 2019, a second mtDNA NGS system became NDIS-approved, Thermo Fisher 
Scientific’s Applied Biosystems Precision ID System (Thermo Fisher Scientific, n.d.). It 
is important to mention that the mtDNA profiles generated by both mtDNA kits can 
only be used for missing person-related searches at NDIS. 
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Validation of NGS-based DNA analysis methods 


The use of CE-based forensic DNA typing is well established in forensic laboratories and 
widely accepted in court. Before the implementation of NGS technology into forensic 
practice and prior to its use in casework, its performance needs to be validated. Any 
new assay or method introduced in a forensic laboratory must be vigorously tested to 
ensure that the procedure works properly and that the results obtained are in accordance 
with the intended purpose (Butler, 2012). Through validation studies, procedural limi- 
tations are defined, and the method is shown to be robust, repeatable, and reliable. 
Once the process of validation is completed, standard operating procedures (SOP) and 
interpretation guidelines are established, and the new method can be used routinely in 
the laboratory workflow. 

Method validation is required by the ISO/IEC 17025 accreditation, which establishes 
the requirements for laboratory competence to carry out tests and/or calibrations. How- 
ever, ISO/IEC 17025 only determines a general standard, and experts in the forensic field 
should give more detailed recommendations (European Network of Forensic Science 
Institutes (ENFSJ), 2010). 

On July 1st’ 2020, the FBI published updated quality assurance standards (QAS) for 
forensic DNA testing laboratories. These standards must be followed by laboratories per- 
forming forensic DNA testing or utilizing CODIS to ensure the quality and integrity of 
the data generated by the laboratory (Federal Bureau of Investigation (FBI), 2020). 

The promising implementation of NGS into the forensic casework routine led to the 
formulation of additional validation guidelines by SWGDAM. SWGDAM is constituted 
of a group of invited guests, representing international organizations or laboratories, 
academia, and accrediting agencies. SWGDAM issues guidelines for implementation 
by crime labs and also provides recommendations to the FBI Director on quality assur- 
ance standards for forensic DNA analysis (Scientific Working Group for DNA Analysis 
Methods (SWGDAM), n.d.). It is important to point out that SWGDAM issues guide- 
lines, not minimum standards. In the case of a conflict between the FBI’s QAS and rec- 
ommendations given by SWGDAM, the QAS and the QAS Audit Documents have 
precedence over these guidelines (Scientific Working Group for DNA Analysis Methods 
(SWGDAM), 2016). 

There are typically two types of validation: developmental and internal. Develop- 
mental validation involves testing new DNA methods for use on forensic samples to 
determine their conditions and limitations. This type of validation is usually done by 
commercial manufacturers and large laboratories (Butler, 2012). It can also be performed 
by smaller laboratories in the following situations: (a) if involved in the development and 
testing of a new method with a commercial company; (b) if involved in developing and 
testing a new in-house method; or (c) if a previously validated method is significantly 
modified or applied to a new situation and needs revalidation. Internal validation is 
done to verify if established methods and procedures will work as expected in one’s 
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own laboratory (Butler, 2012). This type of validation is usually done by smaller local and 
state forensic DNA laboratories before implementing a previously validated (by develop- 
mental validation) method or procedure (Butler, 2012; National Institute of Standards 
and Technology (NIST), n.d.; Scientific Working Group for DNA Analysis Methods 
(SWGDAM), 2016). 


National DNA Index System 


The National DNA Index System, best known by its acronym NDIS, is a database of 
DNA profiles obtained and uploaded by federal, state, and local participating forensic lab- 
oratories in the United States. The DNA Identification Act of 1994 allowed the estab- 
lishment of NDIS and provided the categories of information maintained in the 
database, as well as the requirements for quality assurance, privacy, and expungement 
that participating laboratories must follow (Federal Bureau of Investigation (FBI), 2021). 
Currently, the DNA data accepted by NDIS include autosomal and Y- short tandem 
repeats (STRs), and mitochondrial DNA (mtDNA). Effective May 1, 2019, the DNA 
data submitted to NDIS must be generated in accordance with SWGDAM interpretation 
and validation guidelines and must follow guidelines outlined in the Addendum to 
“SWGDAM interpretation guidelines for autosomal STR typing by forensic DNA testing 
laboratories” to address the NGS (Scientific Working Group for DNA Analysis Methods 
(SWGDAM), 2019a), the SWGDAM interpretation guidelines for mitochondrial DNA 
analysis by forensic DNA testing laboratories (Scientific Working Group for DNA Analysis 
Methods (SWGDAM), 2019b), and the SWGDAM validation guidelines for DNA analysis 
methods (Scientific Working Group for DNA Analysis Methods (SWGDAM), 2016). 
The current NGS PCR kits accepted for use at NDIS are: 
* Verogen ForenSeq DNA Signature Prep Kit 
¢« Promega PowerSeq CRM Nested System 
¢ Thermo Fisher Scientific Precision ID mtDNA Whole Genome Panel—only the 
mtDNA control region data is approved for NDIS 
* Verogen ForenSeq MainstAY Kit 


FBI’s quality assurance standards 


The validation studies described in the latest version in effect in July 2020 can be applied 
to the implementation of NGS systems (Federal Bureau of Investigation (FBI), 2020). 
Crime labs must validate all new NGS workflows, including modified portions of pre- 
viously validated and approved systems. 
According to QAS Standard 8.2, developmental validation requires (when applicable) 
the following set of studies to be performed: 
¢ Characterization of the genetic marker: it is important to understand and document 
locus characteristics of the genetic markers used, such as inheritance, mapping, 
method of detection, and polymorphism(s). 
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Species specificity: nonhuman studies are conducted to evaluate if nonhuman DNA 
can interfere with the results of the analysis of forensic samples. The assay is not neces- 
sarily invalidated if genetic information is recovered from these nonhuman samples, 
but the data obtained should be used to define the limits of the assay. 

Sensitivity studies: it is important to determine the maximum and minimum DNA 
quantities necessary to produce reliable results. 

Stability studies: known samples are deposited on various substrates and are also put 
under different types of insults, such as environmental and chemical, to evaluate 
the effects of such conditions on the obtained results. 

Case-type samples: mock or real forensic samples (usually from closed cases) are used 
to demonstrate reliable results with casework samples. 

Database-type samples: for this type of study, the sample types and/or sample sub- 
strates tested are those routinely submitted to the database laboratory. 

Population studies: it is important to determine allele frequencies for major popula- 
tion groups to use in the calculation of random match probability. Tests for indepen- 
dence expectations, such as Hardy-Weinberg equilibrium and linkage equilibrium, 
must be performed. 

Mixture studies: it is important to determine the limit of detection when working 
with mixed specimens, especially in the detection of major and minor components 
in samples with more than one contributor. Tested mixed DNA samples must be 
representative of those typically encountered by the testing laboratory. 

Precision and accuracy studies: repeatability and/or reproducibility should be 
addressed. This type of study will demonstrate how consistent the results obtained 
are and how close they match the actual values. 

PCR-based studies: this type of study will demonstrate the reaction conditions 
needed to obtain reliable results and often include studies on thermal cycling condi- 
tions (including primer concentration, DNA polymerase, and other reagents neces- 
sary to the reaction), assessment of differential and preferential amplification, effects 
of multiplexing, assessment of appropriate controls, and product detection studies. 
All developmental validation studies must be documented, and publication(s) sup- 


porting the underlying scientific principle(s) of a method must be available. This way, 
others can understand and assess the validation work that was performed. Also, peer- 
reviewed publications are encouraged due to their importance in the legal system. These 
publications show that the methodology or technology is generally accepted in the sci- 
entific field, and courts can decide whether or not the results obtained are admissible. 


Standard 8.3 describes the internal validation, which requires, as applicable, the 


following set of studies to be performed: 


Known and nonprobative evidence samples or mock evidence samples: Well- 
characterized known samples, such as NIST Standard Reference Materials, should 
be used to test the new method or technology to demonstrate that it is working as 
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expected. Then, the following tests should be done using real case samples or mock 
case-type samples. 

« Known database-type samples: known samples, available database samples, or mock 
samples (these should reflect the type of samples encountered in databasing) should 
be tested and compared to previous results where possible for concordance. 

¢ Precision and accuracy 

¢ Sensitivity and stochastic studies: stochastic effects are common in PCR-based 
methods when working with low amounts of DNA. Therefore, it is important to 
test the limits of sensitivity to understand how it can affect your results and ensure ac- 
curate data. However, understanding just the lower limit is not enough. Testing the 
upper limitations is crucial to know the optimal range of the method. 

* Mixture studies 

¢ Contamination assessment studies: minimizing and assessing contamination is essential 
when working with forensic samples. Negative controls should always be included 
during extraction and amplification procedures to evaluate the detection of exoge- 
nous DNA. The results of this type of study should be used for the development 
of quality control procedures and interpretation guidelines. 


SWGDAM validation guidelines for DNA analysis methods 


The document providing guidelines for the validation of DNA analysis methods was 
revised in November 2016 to address NGS technologies, and it was approved and posted 
in December 2016 (Scientific Working Group for DNA Analysis Methods (SWGDAM), 
2016). It is important to know that this new document supersedes the SWGDAM revised 
validation guidelines published in 2012. 

Laboratories will first determine the type of validation to conduct and then, which 
studies to perform, depending on the methodology and its application, and which num- 
ber of samples will be needed for each study of the validation plan. 


Developmental validation process 

Section 3 of the validation guidelines describes what shall be included, when applicable, 
in the developmental validation process. Most studies are concordant with the FBI’s 
QAS, except for one that addresses NGS technology specifically. The following are 
the studies listed in the developmental validation guidelines: 

¢ Characterization of the genetic marker (subsection 3.1) 

¢ Species specificity (subsection 3.2) 

¢ Sensitivity studies (subsection 3.3) 

¢ Stability studies (subsection 3.4) 

¢ Precision and accuracy (subsection 3.5) 

* Case-type samples (subsection 3.6) 

¢ Population studies (subsection 3.7) 
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¢ Mixture studies (subsection 3.8) 
« PCR-based studies (subsection 3.9) 
¢ NGS-specific studies (subsection 3.10) 


Developmental validation process—subsection 3.10 

According to the updated guidelines, whenever a laboratory is validating an NGS 

method, they should: 

¢ Assess the effects of barcoding/indexing samples and subsequent bioinformatic sample 
separation. When appropriate, the effects of novel barcodes/indices that have not 
otherwise undergone developmental validation should be included (subsection 

3:10:1): 
¢ Address the limit of detection as it relates to both starting DNA input as well as the 

extent of varying the quantity (extent of sample multiplexing in sequencing) and/or 

quality of libraries pooled in the sequencing reaction (subsection 3.10.2, sensitivity 
studies). 
¢ Assess NGS instrumentation for the possibility of signal cross-talk during sequencing 

and sample carryover between runs (subsection 3.10.3). 

In NGS methods, multiple libraries are pooled and sequenced together in a single run. 
This is possible because of the use of barcodes/indices, which are unique index sequences 
added to each library to distinguish between them during data analysis. However, 
sequencing or demultiplexing errors can occur, and one way to evaluate this is by using 
bioinformatic negatives, which consist of using barcode/indices combinations different 
than the ones used for samples in the run. Also, runs might be performed in sequence, 
making it important to use the bioinformatic negatives to detect any carryover from 
one run to the other. For example, if lab is doing two sequencing runs and the second 
one is right after the first one is finished, the barcode combinations used should be 
different for each run. This way, when analyzing the second run, bioinformatic negatives 
can be included, and the barcode combinations used in the first run will be used in the 
analysis of the second run. If no carryover or contamination is present or if there are no 
demultiplexing errors, the reads assigned to the bioinformatic negatives should be very 
low and well below the assigned read threshold. 

Also, samples available for analysis vary in quantity and quality. For this reason, during 
the validation process, it is important to sequence a low number of samples in the same 
run and include samples with higher DNA quantity and quality, such as positive controls, 
samples with low DNA input and/or lower quality, and also negative controls. Then the 
results from this run can be compared with those from a run consistent with larger library 
pools. This type of study will verify the possibility of sample crosstalk and determine if the 
size and quality of the sample pool will influence the data recovered. 
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Internal validation process 

Section 4 of the validation guidelines describes what shall be included, when applicable, 

in the internal validation process. Again, most studies are concordant with the FBI’s QAS. 

Developmental validation studies conducted within the same laboratory may satisfy some 

elements of the internal validation process. 

The following are the studies listed in the internal validation guidelines: 

¢« Known and nonprobative evidence samples or mock evidence samples (subsection 
4.1) 

¢ Sensitivity and stochastic studies (subsection 4.2) 

¢ Precision and accuracy (subsection 4.3) 

¢ Mixture studies (subsection 4.4) 

* Contamination assessment (subsection 4.5) 

* NGS-specific studies (subsection 4.6): 4.6.1 Sensitivity studies should address the limit 
of detection as it relates to both starting DNA input as well as the extent of varying the 
quantity (extent of sample multiplexing in sequencing) and/or quality of libraries 
pooled in the sequencing reaction. 


Addendum to “SWGDAM interpretation guidelines for autosomal STR typing 
by forensic DNA testing laboratories” to address next-generation sequencing 
The SWGDAM interpretation guidelines document for autosomal STR typing, also 
mentioned in this chapter as the parent document, acts as a guide for laboratories in 
the interpretation of DNA typing results from STRs, and its latest version was approved 
on January 12, 2017, and revised on July 13, 2021 (Scientific Working Group for DNA 
Analysis Methods (SWGDAM), 2021). On April 23, 2019, an addendum providing 
guidelines for the interpretation of autosomal STR data developed via NGS was 
approved by the SWGDAM membership (Scientific Working Group for DNA Analysis 
Methods (SWGDAM), 2019a). The publication of this addendum is extremely impor- 
tant to assist forensic laboratories in the process of validation and implementation of 
NGS technology. Important guidelines are given on sequence-based allele designation, 
analytical and stochastic thresholds, mixture interpretation, and statistical analysis based 
on sequence data. According to SWGDAM, these guidelines are for laboratories that 
plan on using binary approaches to interpret sequence-based STR data. It is important 
to remember that sequence-based STR data can be converted to fragment length- 
based STR alleles, making it backward compatible with length-based data generated 
by conventional methods such as CE. 

Forensic laboratories are used to interpreting fragment length-based data and are 
familiar with terms such as peak, peak height ratio (PHR), and relative fluorescence units 
(RFUs). When working with NGS and interpreting data generated by this type of tech- 
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nology, laboratories should get familiar with the new terms. To facilitate the data inter- 
pretation, a simplified way to understand the new terminologies is to make the following 
correlations: 

¢ Peak = Signal 

¢ PHR = Allele count ratio (ACR) 

* RFU = Read count 

Section 1 of the addendum goes over the interpretation of DNA typing results. Un- 
like CE-generated data, where an electropherogram is created based on fluorescence in- 
tensity data and peaks are labeled by the software using allelic ladders as a reference, 
NGS-generated data need a software program to demultiplex the samples in the 
sequenced library, detect the sequence of nucleotides in each generated strand, and 
then convert the final information into sequence reads. For each locus, the reads are com- 
bined, and total sequence read counts are reported along with other allele-specific infor- 
mation, including each allele designation with its sequence, length, nomenclature, and 
read count. Similar to CE results, NGS final results need to be verified by a DNA analyst 
to ensure the accuracy of the data generated by the software. First, it is verified that the 
run performance and the quality of the data generated meet the established performance 
and quality metrics, which may include the amount of raw data and the quality of the 
generated libraries. Second, the reagent blank(s), positive control(s), and negative con- 
trol(s) are verified to check if they present the correct genotyping results. Only then 
the DNA analyst can analyze each sample in the sequence run. Also, if using multiple 
kits and/or platforms, the DNA analyst must check if the data generated is concordant 
between the overlapping loci of the different kits. 

The following are the main subsections that apply to the interpretation of NGS- 
generated autosomal data. Some subsections are only present in the parent document, 
and there are no NGS-specific guidelines in the addendum. 

Subsection 1.1. Analytical threshold (AT) is important to allele detection and should be 
set to distinguish between a real signal and background noise. ATs must be defined based 
on internally derived empirical data. For NGS methods, AT represents the minimum 
read count at or above which all detected signals are reliable and should be interpreted. 
There are multiple ways to set ATs, and they can be a fixed read count value or a per- 
centage. As referenced in the addendum document, this last type of AT is known as a 
percentage-based threshold, and it can be calculated by dividing the allele read count 
by the total locus read count. When choosing ATs, laboratories must take into consid- 
eration that nonreproducible noise may be detected above the set AT. To minimize 
this from happening, some labs might increase the AT value. However, it is important 
to take into consideration that raising the AT too much may result in data loss. The other 
extreme is also possible, and careful consideration must be taken not to set too low an AT 
value and run the risk of having erroneous information added to the results and/or having 
an overinterpretation of the data. Defining background noise can be challenging, and 
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background noise levels may vary by sample and by locus within a sample. For this 
reason, different ATs can be set for different loci instead of using the same AT value 
for all loci analyzed. 

Subsection 1.2. The laboratory must evaluate the sequencing run based on previously 
established quality metrics. The lab can choose to evaluate the sequencing run based on 
results from positive controls and sequencing standards or based on run metrics defined 
during the validation process. The run metrics can include loading density, cluster den- 
sity, clusters passing the filter, phasing and prephasing values (percentage of molecules 
within a cluster that fall one base pair behind or ahead in each cycle), total reads per sam- 
ple, total reads per run, forward/reverse strand balance, Q-scores (estimate of the prob- 
ability of that base being called wrongly), etc. These metrics will only provide an 
overview of run performance, not the performance of each sample in the run. 

Subsection 1.3. Controls must be included in each run, and laboratories must establish 
criteria on how to evaluate them. Controls may include reagent blanks and positive and 
negative controls. Additional controls may also be included for troubleshooting and qual- 
ity control purposes. 

Subsection 1.4. Locus designations and locus assignment for alleles must be done based 
on established criteria by the lab. Positive controls can be used for the verification of cor- 
rect designations. In this case, the lab needs to confirm that the software used for analysis 
has been validated for this purpose. Locus designations must include the range of the re- 
ported sequences, which should be based on a relevant human genome reference 
sequence (e.g., hg 19, GRCh38, etc.). 

Subsection 1.5. Allele designation must be done using length-based and/or sequence- 
based data. SWGDAM has yet to provide additional guidance on STR sequence nomen- 
clature but suggests that allele designation, as a numerical value or as a sequence, be done 
in accordance with the guidance provided by the ISFG (Parson et al., 2016; Phillips, Get- 
tings, et al., 2018). When the allele designation is sequence-based, there can be a variation 
in the sequence range reported depending on the NGS assay and/or analytical software 
used. The reported sequence may include the repeat region only or may also include 
additional information on the flaking region. When the allele designation is length- 
based, it should be done according to the fragment size and not just based on the number 
of repeat units in the core region. This method of designation should maximize concor- 
dance with CE methods, and the developmental validation of the analytical software is 
crucial to ensure this concordance between NGS and CE platforms. The sequencing 
of STRs has enabled the identification of isoalleles, which are alleles with the same 
size (length) but different sequences (Table 25.1). For this reason, laboratories must 
have established criteria to report isoalleles if sequence data is used for interpretation. 
However, even if the sequence data is mostly used, laboratories should search databases 
using the length data due to backward compatibility. Discrepancies between NGS- 
developed length-based and sequence-based alleles can arise due to flanking region 
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Table 25.1 Representation of isoalleles in locus D12S391. Notice that all alleles have the same length 
but different sequences. 


Locus Chromosome Allele (length) Bracketed repeat region 


D12S391 
D12S391 
D12S391 
D12S391 


insertions or deletions, for example. In these situations, laboratories should have an estab- 
lished policy for addressing and documenting these discrepancies, including, if possible, a 
reason for the nonconcordance. Also, there must be established guidelines for the desig- 
nation of alleles containing an incomplete repeat motif or any reported sequence result- 
ing in other than a whole-number allele. And finally, previously unobserved sequences 
and/or unusual motifs may be encountered with the sequencing of autosomal STRs. In 
the event of this type of situation, the following steps are suggested: Consider the data 
quality metrics, such as the ones described in subsection 1.2, before recognizing the 
new sequence or motif; perform a literature search to confirm that the encountered 
sequence or motif has not been previously published; resequence if the sequence or motif 
is not found in the literature; and consider if the unusual sequence has been properly 
designated by the software (length-based and sequence-based designations). 

Subsection 1.6. Nonallelic signals may be present in the final data and must be correctly 
identified. These signals may be PCR products, such as a stutter, or analytical artifacts. 
Validation studies are important to aid in the identification of nonallelic signals. Specific 
empirical data for each NGS assay and detection system must be used to establish inter- 
pretation guidelines. Noise signals can be identified based on the sequence of the artifact 
and/or its reproducibility. A stutter signal can be identified if its sequence differs in 
length, or the number of repeat motifs, when compared to the parent allele, and if it ex- 
hibits a lower read count in relation to the authentic allele. An example can be found in 
Table 25.2. Well-defined thresholds based on validation data will aid in the identification 
of real alleles versus nonallelic signals and in determining what to expect with regard to 
the proportion of noise/stutter relative to a real allele. Because the proportion levels can 
vary with the increase or decrease of read counts, ATs that are locus-specific might be 
needed, as previously discussed in subsection 1.1. 


Table 25.2 Representation of an authentic allele and its sequence, length, and read count, and a 
stutter sequence at 8.9% of the parent allele signal 2. 


Locus Chromosome Allele (length) Read count Bracketed repeat region 


D128391 12 pe) 35 [AGAT]12 [AGAC]10 
D128391 12 23 395 [AGAT]13 [AGAC]10 
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Subsection 1.7. Stochastic effects are common when amplifying low-level DNA sam- 
ples. When working with CE methods, different peak heights may appear for a hetero- 
zygous locus, or an allele dropout can occur. For this reason, a specific peak height value 
is defined as a stochastic threshold. For NGS methods, instead of differences in peak 
heights, there can be differences in read counts, with the minor allele presenting a lower 
number of sequencing reads, leading to a heterozygous read count ratio imbalance or an 
allele dropout. A fixed read count stochastic threshold and a percentage-based threshold 
should be established and can be applied to all loci analyzed or be locus-specific. Stochas- 
tic thresholds must be based on empirical data and be specific to the amplification and 
detection systems used. Again, this shows the importance of validation studies and 
how the data obtained can aid in the determination of thresholds. 

Subsection 1.9. The estimation of the number of contributors to a DNA profile can be 
done based on the number of alleles present at all loci. If one or two alleles are present in 
each locus, then it’s usually assumed that there is only one contributor. If three or more 
alleles are present at one or more loci, then this can be indicative that there are multiple 
contributors. When working with a CE-generated profile, PHRs may also be analyzed 
and compared to the empirically determined heterozygous peak height ratio expectation. 
For NGS-generated profiles, PHRs are not applicable, but information on isoalleles can 
be used, and the number of alleles with different sequences but equal lengths can be 
indicative of the number of contributors. 

Section 2 of the addendum goes over the overview and strategies for mixture inter- 
pretation. The parent document goes over the guidelines for STR mixture interpretation 
based on CE data. If'a laboratory is working with length-based NGS data, then the same 
guidelines will apply. The addendum focuses on the differences in interpreting mixtures 
when sequence-based NGS data are used. The following are the main subsections that 
apply to the evaluation of mixture profiles at the sequence level. Some subsections are 
only present in the parent document, and there are no NGS-specific guidelines in the 
addendum. 

Subsection 2.1. All criteria used for mixture interpretation shall be supported by the 
data and shall be defined and documented. This includes all assumptions that are made 
based on information obtained from isoalleles. When working with sequence data, the 
detection of isoalleles can aid in the interpretation of the data and in distinguishing alleles 
from minor and major contributors. 

Subsection 2.2, When dealing with two-person mixtures, pair-wise comparison of all 
potential genotypic combinations should consider sequence data and allele count ratios 
(also known as ACR, which is analogous to CE-based peak height ratio) of all sequence 
combinations, including isoalleles. 

Subsection 2.3. When dealing with greater than two-person mixtures, the interpreta- 
tion of data must be based on results obtained from sequence-based mixture studies, 
including known contributors with isoalleles. 
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Subsection 2.4. Again, the presence of isoalleles will help when interpreting a mixed 
sample. When dealing with mixtures with a single major contributor and one or more 
minor contributors, the sequence data may differentiate isoalleles, and this may allow 
the determination of major and minor sequence alleles that are the same length. 

Subsection 2.6. Sequence information can be helpful when interpreting potential 
stutters in a mixed sample because it can allow the differentiation between stutter and 
same-length minor contributor alleles. And once more, validation studies are important 
to provide information on sequence-based stutter expectations. 

Section 3 is the final section of the document and goes over the statistical analysis 
done after the interpretation of the data is finalized. The parent document goes over 
the guidelines for CE-based comparisons of reference profiles to evidence profiles. How- 
ever, if a laboratory is working with length-based NGS data, then the same guidelines 
will apply. The addendum focuses on the differences encountered when comparing pro- 
files at the sequence level. There is only one subsection that presents NGS-specific guide- 
lines. The other subsections are only present in the parent document, and there are no 
NGS-specific guidelines in the addendum. 

Subsection 3.2. Special attention must be paid to the reported range when comparisons 
of reference profiles to evidence profiles are done using sequence data. For this type of 
comparison, sequence-based allele frequency data shall be used, and the sequence range 
of the allele frequency data should be the same as the reported ones for the laboratory 
data. However, if the reported range for the frequency data is larger, then the laboratory 
can first truncate the published frequency data set to match their reported range. If the 
reporting range for the frequency data is smaller, then the laboratory can truncate the re- 
ported range for their allele sequence data to match the frequency data. For statistical cal- 
culations of sequence-based single-source genotypes, isoalleles are considered 
heterozygote genotypes, and the formula is 2pg. However, if laboratories are performing 
length-based analysis, then they should have policies for interpretation and statistical anal- 
ysis when isoalleles are encountered to avoid concealing potentially exculpatory 
information. 

And lastly, there is no current guidance regarding theta values for sequence-based 
data; therefore, the existing NRC II guidance should be followed (NRC II 4.4a, where 
typically 8 = 0.01 for most U.S. groups or 0.03 for some isolated populations). 


SWGDAM interpretation guidelines for mitochondrial DNA analysis by 

forensic DNA testing laboratories 

The SWGDAM interpretation guidelines document for mtDNA analysis was revised to 
address NGS, and it was approved on April 23, 2019 (Scientific Working Group for 
DNA Analysis Methods (SWGDAM), 2019b). This document supersedes the 
SWGDAM interpretation guidelines for mtDNA analysis by forensic DNA testing lab- 
oratories, published in 2013. Laboratories should develop and implement interpretation 
guidelines based on validation studies that follow the FBI QAS and the SWGDAM vali- 
dation guidelines. The guidelines document is meant to assist in the interpretation of 
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results generated by Sanger or NGS technologies. mtDNA analysis is already sequence- 
based, and the use of NGS methods is well suited for this type of analysis. Because the 
final data format is very similar between Sanger-generated data and NGS-generated 
data, many sections of the interpretation guidelines document are applicable to both 
methods. For the purpose of this chapter, the focus will be on the subsections with 
NGS-specific guidelines. 

Section 2 focuses on sequence analysis and nomenclature. In subsection 2.2.1, the 
criteria that must be established for the analysis of sequencing results are discussed. Lab- 
oratories should create guidelines to assign nucleotide base calls and to verify the quality 
of results, determining whether they are sufficient for interpretation purposes. All estab- 
lished criteria should be supported by validation studies. For NGS-generated data, indi- 
vidual laboratories should establish criteria for filtering and trimming reads based on 
validation data and/or described in the software literature. Read coverage, Q-scores, 
and strand bias should be considered when determining the quality and suitability of 
the sequence data. Sequencing errors can occur in a biased manner due to one-strand 
orientation. Laboratories can determine a strand bias score and filter out bases with 
extreme strand bias. Other metrics to consider for the designation of bases include variant 
frequency, count, and quality. 

Subsection 2.2.4 discusses the interpretation of mixtures. It is not common practice to 
interpret Sanger-based mtDNA mixed sequences. Sanger sequencing of mtDNA allows 
the detection of mixtures but does not allow mixture deconvolution. For this reason, 
most forensic laboratories choose not to interpret the mixture results. However, it is a 
different scenario when working with NGS data due to its quantitative nature. When 
using NGS methods, laboratories have access to data on the number of reads, which al- 
lows the quantitative determination of the components in a mixed sample. For a labora- 
tory to perform mtDNA mixture interpretation with NGS, validation studies must be 
conducted following SWGDAM validation guidelines. The limitations of the analysis 
must be fully characterized, and experiments to distinguish among mixture, hetero- 
plasmy, damage, and noise must be performed. Once all studies are done, then the lab- 
oratory can establish a protocol for mtDNA mixture interpretation. 

Section 2.3 focuses on the guidelines for sequence nomenclature. Independently of 
the method used to obtain mtDNA sequences, Sanger sequencing or NGS, the same 
SWGDAM nomenclature rules should be followed, and the revised Cambridge Refer- 
ence Sequence (rCRS) described by Andrews and coauthors (Andrews et al., 1999) 
should be used as the consensus sequence. 

The following are the SWGDAM nomenclature rules that analysts should follow to 
code variants from the rCRS: 
¢* Rule 1—Maintain known patterns of polymorphisms (i.e., known phylogenetic 

alignments). Most violations of known patterns of polymorphism involve insertions 

and deletions. A phylogenetic alignment tool is available at empop.org. 
¢ Rule 2—Use nomenclature with the least number of differences unless it violates 
known patterns of polymorphism. 
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¢ Rule 3a—Homopolymeric C-stretches in hypervariable region I (HVI): C-stretches 
in HV1 should be interpreted with a 16189C when the otherwise anchored T at po- 

sition 16,189 is not present. Length variations in the short A-tract preceding 16,184 

should be noted as transversions. 

* Rule 3b—Homopolymeric C-stretches in hypervariable region II (HVII): C- 
stretches in HV2 should be interpreted with a 310C when the otherwise anchored 

T at position 310 is not present. C-stretches should be interpreted with a 311T 

when the anchored T at position 310 is followed by a second T. 
¢ Rule 4—Maintain the AC repeat motif in the HVII region from np 515—525. 
¢ Rule 5—Prefer substitutions to insertions/deletions (indels). 
¢ Rule 6—Prefer transitions to transversions unless this is in conflict with Rule 1. 
¢ Rule 7—Place indels contiguously when possible. 

* Rule 8—Place indels on the 3/ end of the light strand. 

¢ Rule 9—The 3107 nucleotide should not be reported in sample data. As 3107 in the 
rCRS is simply a placeholder intended to maintain historical nomenclature (Andrews 
et al., 1999), differences from the rCRS (..e., deletions) at this position are not bio- 
logically meaningful. 

When working with NGS, as explained in subsection 2.3.4 (Length variants), the 
analysis software used should be taken into consideration when establishing the guide- 
lines for the interpretation of C-stretches. Analysts should do a further review of these 
regions to ensure that variant calling metrics are met because different software programs 
used in NGS workflows may not align homopolymeric (and other) repeat regions uni- 
formly or according to standard forensic practice, as stated in rules 1, 7, and 8 explained 
above. And once more, validation studies are important and should be done to charac- 
terize the performance of the analysis software and establish interpretation guidelines for 
C-stretch regions. 


Peer-reviewed publications on validation studies of NGS systems 


As previously mentioned, peer-reviewed publications of validation studies are not 
required but are encouraged. Many validation studies have been published to this date, 
and they have characterized and shown the limitations of established and NDIS- 
approved NGS workflows, as well as in-house-developed assays and those yet to be 
approved for upload in DNA databases. Here are some examples of published studies 
done to validate the three first NGS systems approved for upload to the NDIS. 

In 2017, Iumina published the developmental validation of the MiSeq FGx Forensic 
Genomics System, comprised of the ForenSeq DNA Signature Prep Kit, MiSeq FGx 
Reagent Kit, MiSeq FGx instrument, and the ForenSeq Universal Analysis Software 
(Jager et al., 2017). The ForenSeq DNA Signature Prep Kit is an assay comprised of 
two primer mixes: DNA Primer Mix A (DPMA) targets Amelogenin, 27 autosomal 
STRs, 24 Y-STRs, 7 X-STRs, and 94 identity-informative SNPs, and DNA Primer 
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Mix B (DPMB) targets each of the loci in DPMA, as well as biogeographic ancestry- 
informative SNPs and phenotypic-informative SNPs. Users may choose which primer 
mix they want to use for analysis. The studies were performed in accordance with the 
2012 revised SWGDAM validation guidelines and encompassed species specificity, sensi- 
tivity, mixed samples, stability @nhibitor, degradation), accuracy, and precision. An impor- 
tant feature demonstrated in the studies performed was the backward compatibility of 
allele calling with current STR databases. Robustness was also assessed through sensitivity 
studies, DNA mixture testing, and stability analysis. Genomic DNA samples were partially 
degraded using mechanical shearing, and the effects of five known PCR inhibitors on 
amplification efficiency were characterized. For the mixture studies, three sets of two- 
person DNA mixtures were prepared in varied ratios. When analyzing data from all 
studies, quality indicators in the ForenSeq Universal Analysis Software assisted in evalu- 
ating all results. Quality indicators are an important feature of the software, and each crime 
laboratory should set thresholds, filters, and other analysis parameters based on data ob- 
tained during their internal validation studies to establish their own internal laboratory 
guidelines and policies. Repeatability and reproducibility studies were also performed 
and indicated 100% accuracy in allele calling when compared to CE data for STRs and 
>99.1% accuracy when compared to bead array typing for SNPs. Limitations were 
encountered when a higher number of high-quality samples were pooled together in a 
single sequencing run, and not all samples yielded full profiles. The authors suggested 
that depending on the level of profile completeness required in certain cases, individual 
laboratories should determine the ideal number of samples in a sequencing run. This 
type of result can be obtained when forensic laboratories perform internal validation 
studies. Other limitations include the imbalance encountered for the autosomal STR locus 
D22S1045; higher relative levels of intralocus imbalance for three SNP loci (1310776839, 
rs7041158, and 186955448); and higher stutter percentages for the rapidly mutating Y loci. 
Once again, individual laboratories should perform internal validation studies and establish 
interpretation guidelines for use in the analysis of their casework samples. Sensitivitiy 
studies showed that the input of DNA ranging from 62.5 pg to 1 ng produced complete, 
accurate, and reproducible genotypes. Overall, the results obtained in this study indicated 
that the MiSeq FGx System is robust, reliable, reproducible, and meets established forensic 
validation guidelines. Since the publication of this developmental validation work, other 
forensic laboratories have performed validation studies in order to develop their own inter- 
pretation guidelines and implement this NGS system in their forensic DNA analysis. 

In 2021, Faccinetto and co-authors (Faccinetto et al., 2021) published internal vali- 
dation studies done for the Precision ID mt DNA Whole Genome Panel (Thermo Fisher 
Scientific). This was a work done by the Department of Police Scientific Investigations 
(Parma, Italy) in collaboration with multiple universities in Italy. In this work, the typing 
of the mtDNA genome was done using the Precision ID mtDNA Whole Genome Panel 
v.2.2 (Thermo Fisher Scientific) and the Ion S5 system (Thermo Fisher Scientific). All 
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studies were performed in accordance with the SWGDAM validation guidelines for 
forensic DNA analysis methods and the ENFSI Recommended minimum criteria for 
the validation of various aspects of the DNA profiling process. The European guidelines 
were issued in 2010 and do not specifically address NGS workflows (European Network 
of Forensic Science Institutes (ENFSI), 2010). NIST Standard Reference Material 9947A 
and the 2800 M control DNA (Promega Corporation), as well as typical casework spec- 
imens, were used for concordance, repeatability, reproducibility, sensitivity, and hetero- 
plasmy detection analyses. All results were compared with data generated by Sanger 
sequencing and another NGS sequencer in a different laboratory. With the results, the 
authors encountered strengths and limitations, mostly related to noise thresholds and het- 
eroplasmy detection. To mitigate such issues, a filtering step for reads was applied to the 
analytical workflow to reduce the noise level, and the authors suggested bioinformatic 
solutions using a probabilistic approach based on the analysis of the distribution of variant 
frequencies to allow the estimation of probabilities for each point heteroplasmy (PHP) 
call. Overall, results from the validation studies showed that the Precision ID Whole 
mtDNA Genome Panel is accurate, reproducible, reliable, and sensitive, enabling the 
generation of useful full mtDNA sequences from challenging samples common in 
forensic casework and proving to be more sensitive than Sanger sequencing. 

An important publication on validation studies was done by the FBI Laboratory in 
Quantico, VA (Brandhagen et al., 2020). This publication describes the developmental 
and internal validation studies done for mtDNA casework. It is important to mention 
that, as explained in the FBI’s QAS, internal validation studies can supplement any vali- 
dation study that is deficient, or, if conducted within the same laboratory, developmental 
validation work may satisfy some of the internal validation studies. The FBI’s work 
focused on testing and implementing the PowerSeq CRM Nested System (Promega, 
WI), an NGS assay that generates information on the mtDNA control region, into the 
FBI Laboratory’s operational casework. This NGS mtDNA system was combined with 
the IWumina MiSeq instrument for sequencing and the GeneMarker HTS software pack- 
age (SoftGenetics, State College, PA) for data analysis. All studies were conducted in 
accordance with the SWGDAM validation guidelines for forensic DNA analysis methods 
and the FBI’s QAS for forensic DNA testing laboratories and were submitted for NDIS 
approval. As already mentioned in this chapter, as of May 2019, the control region data 
generated using the PowerSeq CRM Nested System is approved for upload to the DNA 
database. The results obtained in this study demonstrated that this mt DNA NGS system is 
efficient, accurate, reproducible, sensitive, and reliable. According to the authors, a major 
advantage of this assay encountered during the validation process was its high sensitivity. 
Complete control region sequence data was obtained from as few as 2000 copies of 
mtDNA. To put this into perspective, there are 2—10 copies of mtDNA in one mito- 
chondrion, and there are up to 1000 mitochondria per cell. Taking into consideration 
the mtDNA quantitation values routinely encountered in the FBI Laboratory’s casework, 
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the authors anticipate an increase in the number of complete control region haplotypes 
generated from questioned samples. It is expected that control region data will be 
retrieved from approximately 90% of routine FBI casework samples, which is much 
more than the 20% with the current Sanger sequencing-based protocols. The studies 
also demonstrated that only 30% of the extract volume is required to generate profiles 
from degraded samples when compared to Sanger sequencing. The authors hope that 
these published studies can provide a model for any forensic laboratory seeking to validate 
NGS protocols for mtDNA typing according to current SWGDAM guidelines. 


Funding for forensic laboratories to adopt and validate NGS 
technology 


As already mentioned at the beginning of this chapter, the adoption of NGS by forensic 
laboratories has been slow, and the high cost of instrumentation and the validation and 
training processes are often barriers to the implementation of this new technology in 
forensic analysis. A question that is frequently asked is if there are funds available for 
the implementation of NGS. In the United States, there are generally two ways for a 
forensic laboratory to get funding. The first option is through allocated budgets by fed- 
eral, state, or local annual budgets, and the second is through grants. 

Federal funding is usually offered by the US Department of Justice, and one of the op- 
portunities that accredited forensic laboratories can apply for is the DNA Capacity 
Enhancement for Backlog Reduction (CEBR) Program. For the fiscal year of 2021, 
more than 80 million dollars were available to fund crime labs and help them reduce 
and/or prevent a backlog of forensic and database DNA samples (Bureau of Justice Assis- 
tance, U.S. Department of Justice, 2021). Through this funding program, laboratories can 
purchase newer and more efficient instruments as well as evaluate, validate, and imple- 
ment new chemistries that are approved for use with NDIS, which now include NGS 
systems. Another funding opportunity is the Paul Coverdell National Forensic Science 
Improvement Act (NFSIA) Program (Bureau of Justice Assistance, US Department of Jus- 
tice, n.d.a), which has part of the funding program allocated to address emerging forensic 
science technology, such as new types of instrumentation. These two programs are known 
as formula grants, and they are awarded based on state crime statistics and/or population. 

Other funding opportunities are also available, and these are known as competitive 
grants, which are evaluated by an external review panel and are not given just to 
accredited crime laboratories. 

A popular grant opportunity is the National Institute of Justice (NIJ) Forensic Science 
Research and Development Program (National Institute of Justice, n.d.). This program 
finds basic or applied research and development projects that include: 

* physical capital by supporting the acquisition, maintenance, and development of lab- 
oratory instrumentation. 
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* intellectual capital by supporting researchers and providing learning and training ex- 
periences for scientists at all career stages. 

* structural capital by funding projects that support databases and add to the scientific 
literature. 

Another competitive grant is the NIJ Postconviction Testing of DNA Evidence Grant 
Program (Bureau of Justice Assistance, US Department of Justice, n.d.b), which provides 
funding to states for postconviction DNA testing for violent felony offenses (as defined by 
state law) in which actual innocence might be demonstrated. This grant cannot be used for 
laboratory equipment but can be allocated to the purchase of laboratory supplies, reagents, 
software programs, and personnel to locate and analyze biological evidence. 

When searching for external funding or in the scenario where funding for validation is 
available in the annual budget, forensic laboratories should take into consideration their 
needs, infrastructure, and personnel availability when deciding which NGS system to 
adopt and consequentially validating and establishing SOPs and workflows. 
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Introduction 


Forensic genetic genealogy, or investigative genetic genealogy (IGG), emerged as a 
novel technique in 2018 that combines forensic genetics with conventional genealogy 
and genetics (Glynn, 2022). Genetic genealogy has traditionally been used to identify 
family history or to regain the genetic heritage of a person who unfortunately lost their 
biological identity through abandonment, adoption, misattribution of parentage, 
gamete donation, etc. Recently, this technique has found its way into the criminal 
investigation system, where it is being used to find relatives of perpetrators through 
shared DNA. In 2018, IGG was initially used for the identification of Joseph James 
DeAngelo, who was the prime suspect in the famous Golden State killer case. This 
case was highly publicized in print as well as online media. Since then, this technique 
has garnered global attention and is continuously being used in the investigation process 
by forensic and law enforcement professionals. Fig. 26.1 illustrates the concept of gene- 
alogy in familial searching and shows how a suspect can be connected to close (child- 
parents; descendants) as well as distant (first cousin and beyond) relative members 
through shared DNA (Berkman et al., 2018; Deb, 2022; Greytak et al., 2019; Thomson 
et al., 2020). 

Fig. 26.2 shows the basic overview of IGG from the genealogist’s point of view. In a 
nutshell, DNA-containing samples are recovered from the crime scene. Using SNP 
profiling, questionable genomic data is generated for this sample. This questionable 
genomic data is then run through the genealogical databases. The software compares 
this questioned genomic data with the profiles present in the database, and suitable 
matches are then suggested. These possible matches are then used as investigative leads 
to find out the perpetrators. 


How forensic DNA profiling differs from forensic genetic genealogy 
In principle, both of these techniques are similar; however, in methodology and applica- 


tion, they vary significantly. These differences depend on various factors such as the DNA 
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Figure 26.1 Pedigree showing relatedness. (Image source: sciencenews.org.) 
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Figure 26.2 Basic overview of IGG from the genealogist’s point of view (McDermott, 2020). 


database searched, the data generated, the genomic region used, the technology used, and 
the types and number of DNA markers used (Table 26.1). 


How investigative genetic genealogy works 


To generate the investigative leads, genetic similarities among the known familial rela- 
tions are used. The basic information used in the IGG process is divided into two cate- 
gories: (i) Genetic relative information, which can be generated using a genetic genealogy 


Table 26.1 Summary of differences for FDP and FGG. 


DNA Region of Data file 
Databases markers genome No.of markers Latest technology generated 
FDP | National DNA databases Noncoding CE and PCR Electropherogram 
(US-CODIS core 20 
STRs; UK and 
Ireland employing 
DNA-17) 
FGG | Genetic genealogical ~ 600,000 and ~1 | Next-generation FASTQ 


databases million sequencing and whole 
(FamilyTreeDNA, genome Sequencing 
GEDmatch PRO, Targeted SNP kits 
DNA solves etc.) 


From Glynn, C. (2022). Genes, 13, 1381. https://doi.org/10.3390/genes13081381. 
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database based on its internal comparison of SNP profiles in the database; and (ii) Gene- 
alogical and other frequently publicly available information from eulogies, census re- 
cords, and public databases. The genealogical information is then used to depict 
relationships and generate a list of relatives. Genetic relatedness is deduced using DNA 
segments that are shared by identical by descent (IBD) individuals; that means every in- 
dividual has inherited segments matching with common ancestors. Further, this informa- 
tion is utilized by enforcement agencies to develop the family tree and identify 
high-likelihood suspects. The complete process of IGG (steps 1 and 2) that is book- 
ended by standard police work is illustrated in Fig. 26.3 (Guerrini et al., 2021; McDer- 
mott, 2020). 

IGG is an emerging technology that is most commonly used in the United States (US) 
and European countries. The emergence of many direct-to-consumer (DTC) databases 
has made the required genomic information easily accessible to investigative agencies, 


STEP 1- Biological . sample 
Generates STR profile 
If no suspects 
STR profile is uploaded to CODIS to identify a possible match 


If match confirmed If match not confirmed 


Identify and pursue leads through the nonnal 
channels (interviews, collection of biological 
samples from one or more high-likelihood suspects, 
analyze evidence, and surveil suspects and their 
Associates, generation of STR profiles from such 
samples, direct comparison of those profiles with the 
forensic STR profile) 


j If failed or unsuccessful 


Name of the matching offender 
in CODIS is released to law enforcement 
as an investigative lead 


IGG to develop additional leads 


STEP 2 —-IGG SNP profile is generated from the forensic sample 


Upload in genetic genealogy databases (FamilyTree DNA (Houston, TX, USA), GEDmatch PRO, and DNA 
Solves) 


§ 


generates a list of genetic relatives and the basis for their identification 


develop a family tree (STEP 3) 
*Targeted testing 


identify individuals within the tree who are high-likelihood suspects (STEP 4) 


*Targeted testing: In few cases, samples of DNA might be collected from other members of family for SNP testing and then 
comparison can be done to narrow down the search to specific branches of the larger tree 


Figure 26.3 Basic steps involved in investigative genetic genealogy (IGG). 


Challenges in using genetic genealogy in forensics 


which has also added to its overwhelming use in the investigation process. Countries such 
as the UK, Sweden, Australia, England, the Netherlands, Ireland, and Germany are also 
exploring its potential use to identify unidentified human remains and solve violent 
crimes (Mehar, 2021; Samuel & Kennett, 2020; Scudder et al., 2020; Thomson et al., 
2020; Tillmar et al., 2021). 

Genetic genealogical databases such as Oxford Ancestors (Bicester, UK; year 2000), 
FamilyTreeDNA (Houston, TX, USA; year 2000; www.familytreedna.com), GED- 
match PRO (2010; www.gedmatch.com), DNA Solves, 23andMe (South San Francisco, 
CA, USA; year 2007; www.23andme.com), MyHeritage (Or Yehuda, Israel; year 2016; 
www.myheritage.com/dna), AncestryDNA (Lee Hay, UT, USA; year 2012), and 
Ancestry.com (1996; www.ancestry.com/dna) gather genomic data, while some of these 
allow the information to be used for investigative purposes. As of July 2022, the top 
genealogical databases have 41 million registered users, out of which ~21 million are 
in Ancestry DNA, ~12.8 million in 23andMe, ~6 million in MyHeritageDNA, and 
~ 1.77 million in FamilyTreeDNA (Glynn, 2022; Mehar, 2021). This number continues 
to grow rapidly as IGG gains more acclaim after solving a series of famous cases (Greytak 
et al., 2019). This has also sparked an intense debate on the challenges and ethical con- 
siderations related to the extensive use of IGG. Although IGG has provided an efficient 
tool for solving new and cold cases, it has also raised many ethical and social concerns that 
need to be addressed. It is imperative that a solution be found to these challenges to make 
the use of IGG more transparent, ethical, and socially acceptable. 


Challenges in using genetic genealogy 
Errors 


SNP genotypes can experience two types of errors: (a) Technological errors and 
(b) induced errors (Fig. 26.4). Technological errors are those arising during the process 
of amplification and sequencing or during the bioinformatics pipeline, which includes 
sequence alignment or variant calling (Kling et al., 2021). Once the sequencing has 
been completed a large amount of data is generated, which requires it to be aligned 
with the reference data before comparison. This is challenging as the read lengths are 
tiny, ranging from 36 to 250 bps. The presence of multiple similar sites, repetitive zones, 
pseudogenes, and a massive volume of the data generated can make this process tedious 
and challenging (Duarte Castelao, 2017). Studies report that genotyping errors can be less 
than 1%, whereas phasing errors can be around 0.64% (Browning & Browning, 2007; 
Durand et al., 2014). These errors can significantly affect the likelihood ratio calculations 
up to 0.05% (Kling, 2019). One method to reduce these types of errors is to use a haplo- 
type score incorporating the phase and genotype error rates, which helps in filtering the 
spurious IBD segments during the postprocessing steps (Durand et al., 2014). Some other 
methods to reduce these errors include incorporating the error rates into the segment 
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A. Phase (switch) error 


Individual 1 Individual 2 


Chromosomes 


B. Genotyping error 


Individual 1 Individual 2 


Figure 26.4 Illustration of two types of errors. An IBD segment is prematurely terminated at the 
dashed red line with errors highlighted in red. (A) Phase errors occur when the chromosomes of an 
individual are separated into maternal and paternal origins and where the process of phasing 
(wrongly) switches chromosomes. (B) Genotyping errors occur either during the amplification process 
or at the bioinformatic genotyping level. (From Edge, M., & Coop, G. (2020). eLife, 9. https://doi.org/10. 
7554/eLife.51810.) 


models (Browning & Browning, 2013; Dimitromanolakis et al., 2019) or modeling for 
the errors themselves (Ball et al., 2016; Henn et al., 2012). 

Samples available for database sequencing are of better quality and quantity; therefore, 
a complete genome is available for exploration (high-depth genome sequencing). The 
forensic samples are often of inferior quality and quantity when low-depth genome 
sequencing is deployed. Due to this, higher error rates are observed when the questioned 


sample is searched in the database (Saunders et al., 2007). 


DNA mixtures 

Databases are built using single-source DNA; however, DNA from multiple sources is 
common in forensic cases. There are well-established protocols for the examination of 
such samples and for calculating their evidentiary values (Bleka et al., 2016; Bright et 
al., 2016; Haned, 2011). However, extending the examination to genealogy might be 
challenging. Searches must be conducted using a single-source DNA profile (Phillips, 
2018). This can be accomplished by separating the profiles of contributors either through 
conditioning or by developing a statistical model for it. However, in doing so, a level of 
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uncertainty is generated, which can then affect the false/true positive rates (Bradford 
et al., 2011). Low-quality DNA from forensic cases yields very less information, even 
when whole genome sequencing is employed; however, a statistical model could extract 
contributors from the mixture based on the allele dosage, also known as read counts. In 
the case of mixtures with homozygote genotypes, one can say that the perpetrator will 
also be homozygote. However, in the case of a heterozygote, the genotype can be ho- 
mozygous or heterozygous for the allele. This leads to a rise in the number of false pos- 
itives (Kling et al., 2021). 


Degraded or low-quality forensic samples 

In next-generation sequencing, the quality of extracted DNA depends on yield and 
degradation index (DI). Good-quality DNA has an Agzgo/Azgo ratio of 1.7—-1.8. 
Lower-ratio DNA might be suitable for other applications, but only after some pretreat- 
ment to remove the contaminants. A higher yield indicates good quality, but this is not 
always the case. Sometimes, despite a high yield, the quality of the extracted DNA might 
be poor. This is when the extracted DNA sample contains a large quantity of junk DNA 
(Davawala et al., 2022). Another factor to consider here is the DI. The DI is the ratio of 
smaller to larger DNA fragments in a given sample. A low-yield DNA sample with a low 
DI is better than a high-yield DNA sample with a high DI. 


Need for standards 

There is a consensus among the IGG practitioners that there is a lack of official standards 
and certifications. Gurney et al. reported the formation of a nonprofit Board of Certifi- 
cation for IGG, comprising the most renowned IGG practitioners in the USA, to develop 
IGG standards (Gurney et al., 2022). Some other available guidelines include the Interim 
Policy on Forensic Genetic Genealogical DNA Analysis and Searching (released in 
November 2019 by the US Department of Justice) (Callaghan, 2019) and the overview 
of IGG (released in February 2020 by the Scientific Working Group on DNA Analysis 
Methods, SWGDAM) (Epstein et al., 2000). Still, there is an immense need to develop 
standardized methods, policies, and protocols for the application of genetic genealogy in 
the field of criminal investigations. 

IGG has developed as a specialization of genetic genealogy where the methodologies 
are similar. However, their impact on people might be significantly different. An individ- 
ual who seeks to find out his/her relatives might give their consent; however, the same 
person might retract his/her consent if the same information was to be used in a criminal 
investigation (Gurney et al., 2022). Standardizing the techniques, methodologies, and 
their applications raises public trust against their misuse. This further helps in reducing 
the public’s fear while consenting to the use of their genealogy in a criminal investigation. 
The public also fears that IGG practitioners might be ignoring the terms of service. Stan- 
dards also subside these public concerns. Ram et al. (2021) words that “law enforcement 
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privacy advocates found value in developing guardrails [for IGG]’”, the latter because 
“limits on the technology (of IGG) to promote public trust would help ensure that 
the technology remains available to assist them in their work,” seem to summarize this 
efficiently. 

As the public must have trust in IGG practitioners, the investigating agencies must also 
have trust in the IGG practitioner’s qualifications and ability to perform the assiduous 
tasks of IGG. Standardization could achieve this by controlling the proficiency and 
ethical level delivered by the IGG practitioners. Proficient practitioners can provide value 
for the money, time, and other resources spent on their services, which otherwise go to 
waste. Also, subpar and unethical performances can jeopardize the investigation, in which 
case agencies might totally abandon its use despite its huge potential to provide important 
leads in criminal investigations. To maintain professional proficiency, accreditation stan- 
dards can be used. Standards can also help in setting up accountability where proper pro- 
tocols and standards are not followed. For these reasons, standards can help build the trust 
of the public and the investigating agencies. 


Direct-to-customer genomic services 


In 2021, the DTC genomic industry was valued at around USD 1.29 billion (Direct-To- 
Consumer Genetic Testing Market, 2022), which increased to more than double the 
market value of USD 3 billion in 2022. It stands to grow at an 11.5% compound annual 
growth rate from 2023 to 2032 and reach a market value higher than USD 10.5 billion 
(Swain & Subodh, 2023). This ballooning growth is due to the decreasing cost of large- 
scale SNP typing, rising public awareness about DTC services, rising genetic disorders, 
increased interest in finding relatives and ancestors, and growing demand for such services 
(Court, 2018; Naveed et al., 2015). Most of these services can be availed of through on- 
line platforms (around 77%), whereas the remaining 23% of DTC services are over-the- 
counter services. North American and European countries continue to be the major 
consumers of DTC genomic services. In the forensic context, there are several issues per- 
taining to DTC genomic services. These include false identification, the use of unproven 
or invalid testing methods, and biospecimen or hacking threats (Apathy et al., 2018; 
Carroll et al., 2020; Salloum et al., 2018). 


False identification 

Using DTC genomic services to find relatives can be challenging and misleading, espe- 
cially as you move far back or far across the pedigree. The severity of the problem is 
aggravated because a person could have 1000 fourth cousins and around 5000 fifth 
cousins, out of which only a few could be included in the database. The IGG practi- 
tioners use a triangulation method for identifying the parts of shared genetic sequences 
between the perpetrator and the database; however, using DNA as the sole source might 
be less fruitful at best and erroneous at worst. Therefore, it is advisable to augment the 
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results obtained from the DTC genomic search using conventional genealogical methods 
(Court, 2018). 


Use of unproven or invalid testing methods 

There is a severe lack of genealogical standards in the market, and it is highly pertinent 
that the use of genealogical services be regulated. In the absence of such standards, it is 
very difficult to validate the results provided by these DTC genomic services. Many 
studies have raised serious concerns about the validity of the methods used by these 
DTC services (Court, 2018; Naveed et al., 2015). 


Biospecimen and hacking threats 

DNA is extracted from the cells in its chemical form, and then, using computing systems 
it is digitized (Naveed et al., 2015). Sensitive information pertaining to the genotypic 
and phenotypic makeup is stored in the databases. Owing to the digital nature of data- 
bases, they are susceptible to cyberattacks and information loss. Similar attacks have 
been reported on fingerprint repositories (Aldhous, 2020). In July 2020, genomic 
data of 1 million users was leaked from two online databases, GEDMatch and MyHer- 
itage using a phishing email, redirecting to a fake login page, and then collecting the 
username and password. Any attack on a similar database can severely diminish its cred- 
ibility and stigmatize the usage of genomic databases among people as well as investi- 
gative agencies. 


Intrusion to privacy 


One of the most important privacy-related questions raised by Wickenheiser is: To what 
extent can the rights of the innocent public and relatives of the perpetrator of a crime be 
infringed upon by interrogating their genetic data to identify the crime perpetrator, pre- 
vent future crimes, and improve public safety? Using genealogical surveillance can help in 
apprehending criminals even where no hit is found in the criminal DNA databases. How- 
ever, it raises a serious concern about the individual’s freedom, right to privacy, and au- 
tonomy even when their DNA is not found at the crime scene. Practitioners, law 
enforcement agencies, and police personnel access personal information, including age, 
sex, and location, about both living and dead individuals (Gurney et al., 2022). 


Genomic privacy 

Genomic data is not exceptional; however, when interpreted properly, it can provide 
sensitive information about any individual. This data-centric information has widespread 
applications, and its leakage can have immense implications in terms of discrimination in 
insurance, employment, education, etc. One such example was shared by Dr. Noralane 
Lindor at the Mayo Clinic’s Individualizing Medicine Conference (Lindor et al., 2013). 
She states that while studying cancer patients, she also sequenced the grandchildren of 
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one of her patients and found that two of the grandchildren had the same mutation as the 
patient. When one grandchild applied for the United States Army for the position of he- 
licopter pilot, she was rejected even though the position does not require any genetic 
testing. Many other examples of cancer patients are now available where family members 
of cancer patients are discriminated against in society. One such example is that of the 
family members of Henrietta Lacks (Skloot, 2017). Here, the medical researchers 
uploaded their genotypic information online, and when a complaint was filed against 
this, the information was taken down by the researchers. However, by the time it was 
taken down, the data had already been downloaded and was subsequently published 
either in parts or completely. In some other cases, the genomic information was uploaded 
without getting permission from the relatives of the patient (Nyholt et al., 2009; Rutter, 
2013). 

This discrimination can be more severe from a forensic perspective. Relatives of crim- 
inals can face extreme discrimination because of the seriousness of the matter. The racial 
imbalance in NDIS/CODIS data is quite clear. In the United States population, blacks 
are around 13%; however, in the database, they contribute around 40% of entries. Simi- 
larly, Latinxs make up 22% of the United States population but contribute 22% to data- 
base entries. This situation becomes more severe when one includes data related to 
immigrant detainees (Wetsman, 2020). On examination of online genomic databases, 
one would find that the records are predominantly from the white or Caucasian popu- 
lation, which makes up 75% of total database entries. This percentage is projected to rise 
to 90% in the coming years (Kaiser, 2018). This racial disparity in DNA databases can be 
due to two factors: (a) Disparity in people who get arrested, and (b) disparity in resources 
to get out of it. Even in countries such as the United States, the financial divide between 
the white population and the population of other races is significant, so the white pop- 
ulation is more likely to go through DNA sequencing to store their genomic data in a 
database (Klugman & Rodriguez, 2021). Based on this, crimes related to the white pop- 
ulation have a significantly higher chance of getting solved using these genomic 
databases. 

There is another facet to this problem. NDIS/CODIS was developed to store the data 
of criminals for their future identification, and this was acceptable. Eventually, many or- 
ganizations uploaded the relevant data even when the person was arrested and later let go, 
or even when the person was arrested for minor crimes. Once the person’s data is 
uploaded to the database, he/she is presumed to be guilty of the crime without proof 
in a court of law. Extending it to the relatives of the perpetrator might be a breach of 
the relative’s privacy. In Maryland v. King, the US Supreme Court held that by commit- 
ting the crime, the criminal surrenders their right to privacy (Field et al., 2017). The 
important question here is whether this also applies to the relatives of the criminal 
(Haimes, 2006). Do the relatives of the criminal give up their right to privacy just by be- 
ing related to the criminal? Should the families of criminals be subjected to lifelong 
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surveillance and indirect lifelong surveillance? If not, then why should they be involved 
in criminal investigations? The family should not be subjected to any such discrimination, 
as every person is equal before the law, irrespective of their relationship with the criminal, 
and must be innocent until proven guilty. Genomic data of a person can reveal their 
phenotypic information, which can lead to additional discrimination based on race, 
gender, or any other medical or criminal features (Klugman & Rodriguez, 2021). The 
use of genomic data might be considered easy and convenient; however, considering 
the discriminatory possibility, it must be used with caution. 


Bypassing codes of informed consent 

The process of obtaining consent is a key element of medical and biomedical research. It 
makes the donor autonomous in choosing whether or not to donate their genomic data 
by informing them of the risks and benefits associated with the application (Hoeyer & 
Lyn6e, 2006; Tindana et al., 2019). This autonomous choice has three features: (a) 
Adequate information; (b) voluntary or coerced choice; and (c) rationality in choice. 
In the case of IGG databases, donors should be clearly explained the information 
regarding consent in simple, clear, and complete words (Wickenheiser, 2019). 

A biomedical biobank is developed to fight critical diseases and supplement medical 
knowledge. In contrast, the aim of using genomic databases for criminal investigations is 
to identify the perpetrators using pedigree or familial relationships (Machado & Silva, 
2015). Therefore, consent in both settings should be explained differently. For example, 
a person might readily give his/her consent for medical research or to identify his/her 
relatives. The same person might not be willing to give consent if the information is 
to be used for a criminal investigation. Therefore, it is important to plan guidelines 
that are specific to the use of genomic data in criminal investigations. The process of con- 
sent, if efficient, can turn the risks of IGG into its strengths (Court, 2018). Ifneglected, it 
can severely affect the credibility of the IGG databases. 

In a 2007 Nuffield Council study, it was observed that over 40% of donors gave un- 
limited consent for the usage of their genomic data for research purposes, which was later 
monetized, and then the consent could not be withdrawn (Court, 2018; Parry, 2008). 
There are some serious concerns associated with the consent information and how it is 
delivered to the donor. It is important that the donor is informed correctly of his rights, 
risks, and duties and that he understands this information. Many companies only inform 
donors that their genomic data will be used for medical or research purposes and not for 
investigation (Berkman et al., 2018). Some companies, such as GEDmatch and Family- 
TreeDNA, allow the use of databases for investigative purposes. They only ask for broad 
consent rather than specific case-to-case consent, which is important in such cases. The 
settings of the consent can be changed at later stages as well (Ram & Roberts, 2019; Rob- 
erts & Hawkins, 2020). Other companies use undecipherable or incomplete information 
that the donor does not read or understand (Radin, 2012). 
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How the consent guidelines should be written is yet to be resolved. Some studies sug- 
gest that people know about the importance of consent while collecting the samples; 
however, they are divided on how specific the consent must be (Grady et al., 2015; Wen- 
dler, 2006). Some individuals believe that broad consent is acceptable, while others 
emphasize that more information must be provided to make an educated decision, and 
some others emphasize that the information provided should be in such a way that it 
is easily understandable (Madrigal, 2012; Samuel & Kennett, 2021). 


Genetic surveillance 

One of the many ways of misusing the genetic information available in IGG databases is 
to use genomic data for genetic surveillance (Zaretsky, 2021). Genetic surveillance in- 
cludes tracing the individual and his family using their genomic information. Databases 
often have an overabundance of data from specific populations, and therefore, this infor- 
mation can be used to find individuals of a specific race, ethnicity, caste, or religion 
(Rosen, 2009). Professor Mnookin has called the process of familial searching a discrim- 
inator and adds that this discrimination is not just limited to race or class but extends to 
good relatives because of some bad ones (Epstein, 2009). 


Via genealogical databases 

Users upload their genotypes to the direct-to-customer databases, which are online, to 
search for their long-lost relatives. Once they upload their data, the database searches 
it to generate a report. This report may contain substantial information about one’s pu- 
tative relatives (Edge & Coop, 2020). Identical by state (IBS) genomic runs of (near) iden- 
tical sequences shared among individuals can be thought of as a superset of true IBD 
segments. Edge and Coop discuss the methods by which the genetic privacy of the 
user can be compromised. They describe three methods for revealing the gene 
sequencing data in these online databases, where one can upload their own genotype 
(Edge & Coop, 2020). These methods are: (1) IBS tiling; (2) IBS probing; and (3) IBS 
baiting. 

In IBS tiling, the genotype information shared between a target user in the data- 
base and each member of a set of comparison genomes is aggregated into potentially 
substantial information about the target’s genotype (Fig. 26.5). For example, a person 
of Caucasian origin is likely to have some genetic similarity with other people of the 
same origin across the world. By comparing the target user’s genotype with that of a 
random person of the same descent, one can gather some information. If we extend 
this method to more individuals of the same origin, we can gather even more informa- 
tion about the target individual. By combining all the gathered information, one can 
uncover much of the target’s genome through the process of “tilling” of shared IBS 
with known genotypes. Compared to IBS segments, true IBD segments can reveal 
more information about shared genotypes. This is because of the higher probability 
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of untyped variants getting shared among individuals of similar origin. The distinction 
between two IBS, where IBS1 covers only one chromosome and IBS2 both chromo- 
somes, is difficult in the absence of haplotype information. IBS tiling provides high pop- 
ulation variation. It is difficult to differentiate between two IBS segments through this 
method using a haplotype. Therefore, the IBS tiling rates are varied, especially depend- 
ing on the population under study. Genomic regions with low SNP density and low 
heterozygosity are prone to biases. This means that the recovery of minor alleles using 
IBS tiling is slightly less in comparison to the genomic population (Edge & Coop, 
2020). 

IBS probing is used to identify people of a specific genotype (Fig. 26.5). For example, 
if one wants to identify people having the risk alleles of, let’s say, Parkinson’s disease, even 
when there is no record of this information. All one has to do is prepare a haplotype with 
a specific genotype and search the public database. The database will search through it 
and suggest all the people who match this specific allele. For this, a chromosome that 
is unlikely to have long shared segments is prepared so that it is free of linkage disequi- 
librium. Long matching segments represent close relatives, and therefore, their presence 
reduces the sensitivity of the IBS probing method. The success of the probing method 
also depends on the frequency of the allele of interest in the given population. The rarer 
the allele of interest, the more likely the results are to be significant. If the allele of interest 
is common, there is a higher chance that they will share the same long haplotype around 
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the site of interest, and thus fewer probes will be required to find out the identities of the 
carriers in the given population (Edge & Coop, 2020). 

IBS baiting uses incompatible homozygous sites, which are homozygous in pairs for 
one allele in one individual and homozygous in pairs for another allele in another in- 
dividual. In this method, using heterozygous sites, IBS segments are constructed. Using 
this constructed IBS segment, the genotype of the target sequence is revealed. IBS bait- 
ing is of two types: (a) Single-site IBS baiting and (b) Parallel IBS baiting. In the case of 
single-site IBS baiting, a sequence at a single site is revealed, whereas in the case of par- 
allel IBS baiting multiple sequences are revealed simultaneously (Fig. 26.6) (Edge & 
Coop, 2020). 


Ethical issues 

Forced collection of DNA samples from arrested individuals 

It is common practice to collect an individual’s DNA sample on arrest as a procedure, 
even when the crime has not been proven in a court of law. It is considered that the 
accused has forsaken his right to privacy. However, jumping to this conclusion can be 
very problematic. First, taking and analyzing DNA samples is part of a search, which 
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Figure 26.6 Schematics of the IBS baiting procedure (A) To perform IBS baiting at a single site, two 
uploads are required, each with runs of heterozygous genotypes flanking the key site. At the key site, 
the two uploaded datasets are homozygous for different alleles. The three possible target genotypes 
at the key site can each be determined by examining their IBS coverage with the uploads. If there is a 
break in IBS with either upload, then the target is homozygous for the allele not carried by the upload 
that shows the break in IBS (with the broken IBS segment shown as a cyan line). If there is no break in 
IBS with either upload, then the target is heterozygous at the key site. (B) Target genotypes at many 
key sites across the genome can be learned by comparison with two uploaded datasets, as long as key 
sites are spaced widely enough. 
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must not be started without probable cause and judicial authorization. Second, this 
mandatory DNA sample collection must not be considered a part of a special need, 
especially when the guilt has not been proven. Based on the first two features, the 
forceful collection of DNA samples must be unconstitutional (Rosen, 2009). DNA 
sampling must be considered a search because it includes public exposure, bodily 
intrusion, and information extracted. Even though the criminal has forsaken their 
right to privacy, no consent has been given by the relatives of the perpetrator to 
use their genomic data to incriminate their relatives (Court, 2018). This applies to 
all the databases storing the genomic data, whether for any medical research or for 
investigative purposes. 


Lack of a law protecting privacy 

Every country has some fundamental rights and laws protecting privacy, but laws protect- 
ing genetic privacy are few and rare. The debate on genetic privacy is ongoing across 
countries. Many states in the United States have implemented laws protecting genetic 
privacy, and some others are in the process of formulating such laws. Some European 
countries have also come forward to develop such laws. Despite such steps, there is a 
dire need for uniform laws against genetic privacy breaches (Hudson et al., 2008; Norr- 
gard, 2008). Many laws, such as the Fourth Amendment of the United States Constitu- 
tion, Section 20, and Section 21 of the Indian Constitution, provide protection against 
unauthorized searches, seizures, and sample collection (Baird & Avots, 2008; Khosla, 
2012). This also includes the rights of the dead, which are often termed postmortem pri- 
vacy. What if the deceased did not desire their genomic data to be used in the database? In 
such cases, the genomic data must be used with caution. 


Case studies 

Phoenix canal killings 

The canal killing in the Phoenix area is an important case that emphasizes the use of GG 
to solve the mystery. In 1990, two women in the Phoenix area were murdered. Bryan 
Patrick Miller was linked using DNA findings to the deaths of Angela Brosso (November 
1992) and Melanie Bernas (September 1993). This was the first-ever case that was solved 
using genetic genealogy in the year 2015. 


Golden State serial killer or case of Joseph DeAngelo 

In 2018, IGG was successfully used in the Investigation of the Golden State serial killer 
case to identify East Area Rapist Joseph James DeAngelo for committing 45 rapes/ 
sexual assaults and 13 murders in California in the 1970s and 1980s. Investigators 
collected left-over cans as evidence from outside of DeAngelo’s house, and eventually 
a piece of tissue from the trash was used to get the arrest warrant against DeAngelo. 
Investigators uploaded the DNA profile of the suspect collected from the crime scene 
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of the murder in 1980 to a genealogy GEDmatch genealogical database and traced out 
the familial tree of the suspect based on closed genetic matches (Mehar, 2021; Sarbak 
& Won, 2023). 


Way forward 


The use of genetic genealogy is only two decades old, as its first usage was reported in 
the year 2000. Its prime use was in the field of medical research, where it was being 
used to track the genotype sequences that were responsible for serious medical con- 
ditions such as cancer, etc. Starting with the Golden State killer case, the use of genetic 
genealogy in criminal investigations caught pace. Despite many analytical issues (er- 
rors), ethical issues, privacy issues, and other challenges, their use is increasing expo- 
nentially, as is evident by the ballooning growth of the genetic genealogy market. 
Table 26.2 enumerates key challenges and possible solutions that can make the 
IGG practice more reliable, credible, and authentic. The first requirement is to set 
up guidelines for the use of IGG and protocols for the methods, techniques, and appli- 
cation of IGG. The second is to simplify while explaining in detail the implications, 
risks, threats, and applications of the consent before sampling and during its use in the 
criminal investigation. Another important requirement is to make sure that volunteers 
do not just accept the terms and conditions but rather understand the consent 


Table 26.2 Challenges and solutions to address challenges related to IGG. 


S.No Challenges Possible solution 
1. Technological and phase errors Use a haplotype score incorporating the phase 
and genotype error rates 
2: Presence of DNA mixtures on the | Separate profiles of the contributors 
crime scene 
3. Degraded or low-quality samples | Pretreatment to remove contamination 
4. Need for standards Formulation of standards 
5. False identification Restrict usage of IGG for finding distant relatives 
6. Use of unproven or invalid testing | Standardize the methods and protocols 
methods 
i Biospecimen and hacking threats Measures to secure the cloud systems 
8. Genomic privacy Regulation to protect genomic privacy 
9. Informed consent Formulate guidelines for obtaining consent 
10. Genetic surveillance Laws against prejudicial and unauthorized 
genetic surveillance 
11. Ethical issues (forced collection of | Formulate guidelines and laws to address these 


DNA samples from arrested 
individuals and lack of laws 
protecting privacy) 


issues 


Challenges in using genetic genealogy in forensics 


information properly. The third is to clearly explain the process of the use of genomic 
data for criminal investigation and how it eventually affects the final verdict in court. 
Just the presence of DNA at a crime scene does not prove that the person has 
committed the crime. Therefore, the information provided by IGG must not be taken 
as absolute proof of criminal activity. Fourth, guidelines against the use of IGG for 
prejudicial familial searching, genetic surveillance, and other types of discrimination 
must be formulated and enforced. There must be a mechanism of grievance in cases 
of genetic privacy breaches. Fifth, in accordance with these guidelines, there must be 
laws pertaining to genetic privacy. IGG is nascent and is still evolving, and it is manda- 
tory for IGG practitioners, investigating agencies, volunteers, and legislative bodies to 
work in accordance to bring out the best of IGG. 


Conclusions 


Genetic genealogy has come a long way since its inception and has the potential for even 
higher growth. It has immense application in the field of medical research. Its application 
in the field of criminal investigation is very recent and must be used with great caution. 
Ethical and privacy concerns, prejudicial familial searches, and genetic surveillance are 
some of the key challenges of IGG. It is important that IGG practitioners come together 
to address these issues and formulate the necessary guidelines for the unbiased use of IGG. 
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What is next-generation sequencing? 


Next-generation sequencing (NGS) is a relatively novel technology that enables re- 
searchers to sequence large amounts of genetic material quickly and cost-effectively. 
NGS is based on the process of sequencing short fragments of DNA or RNA, which 
are then assembled into a complete sequence of the genome. 

NGS techniques use modified nucleotides, polymerases for synthesis, and fluores- 
cence detection, which is the same idea behind Sanger sequencing. Although the 
DNA template must be clonally amplified for some NGS platforms, such as Illumina, 
Life Technologies SOLiD, Ion Torrent Personal Genome Machine, and Roche 454 sys- 
tems, preamplification is not required for the more sensitive HeliScope and Pacific Bio- 
sciences single-molecule real-time systems (Buermans et al., 2014; Stranneheim et al., 
2012). Different NGS platforms use various chemistries for the preparation of libraries 
and sequencing (Hodzic et al., 2017). 

We can confirm that NGS enables scientists and investigators to sequence large 
amounts of genetic material contained in different nucleic acids at once. This would 
otherwise be virtually impossible or impractical to achieve with other sequencing tech- 
nologies. Indeed, researchers have been able to sequence the entire human genome in a 
single experiment and make discoveries that would have been impossible in the past. 

Virtually all downstream analysis and interpretation processes rely on accurate variant 
calling, which is a crucial step in the processing of NGS data. The software tools, tech- 
nologies, and methods for finding sequence variants have advanced significantly during 
the past few years, clearly paralleling the advancement of NGS technologies (Koboldt, 
2020). 


Next-generation sequencing for forensics 


NGS, despite being a relatively new method, has been quite useful for forensic sci- 
ences and investigations in the past few years. Indeed, it can be used for sequencing 
whole genomes or portions of genomes with a high degree of accuracy (Zhang 
et al., 2019). 
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It has been applied to forensic DNA analysis, alowing examiners to generate data that 
spans the human genome and addresses a wider range of questions in a single, targeted 
assay (Zhang et al., 2019). NGS-generated short tandem repeat (STR) calls are 
completely consistent and compatible with current database formats. Additionally, 
NGS technology can be used to identify organ- and developmental stage-specific expres- 
sion as well as miRNA sequences, and some single nucleotide polymorphisms (SNPs) 
may be used to predict the bio-geographical ancestry of unknown crime scene samples 
(Carratto et al., 2022). 

We can clearly see that the rapid emergence of NGS techniques, also sometimes 
known as massively parallel sequencing methodologies, has revolutionized many 
genomic fields, including forensics. In 2005, Roche 454 introduced the first high- 
throughput sequencing platform based on the pyrosequencing technique. 

Since then, NGS methodology has undergone significant enhancements and im- 
provements, with decreased costs and analysis time and increased sensitivity. Robust tools 
and instruments have been developed and are currently commercially available. 
Although standard DNA analyses in forensic laboratories involve STR profiling using 
capillary electrophoresis (CE), this methodology has certain limitations, including its 
inability to distinguish monozygotic twins or work with low quantities of samples. 
Low-quality samples can also hinder the accuracy of the analysis (Carratto et al., 
2022). Additionally, the CE methodology presents a lower multiplex capability. Thus, 
new approaches are necessary and even paramount to being able to extract more infor- 
mation from DNA to aid forensic casework. 

NGS presents several advantages and improvements over conventional CE analyses, 
including the capability of analyzing thousands of markers of various types at once, 
providing different information in a single analysis. NGS can also better handle the 
low-quantity and degraded samples that are common in forensic investigations. There- 
fore, NGS is deemed to be a powerful tool for deeper DNA analysis in forensic genetics, 
leading to its steady introduction and widespread use as a standard method in forensic 
laboratories. 

In a recent survey from 2019 of 105 European forensic laboratories, 46% of partici- 
pants already owned NGS equipment and consumables, while 26.7% planned to acquire 
them within the next 2 years (Gross et al., 2021). Clearly, NGS analysis can empower the 
study of polymorphisms, opening new horizons and opportunities for forensic genetics 
and molecular biology. 

Moreover, in situations like mass catastrophes, man-made disasters, terrorist attacks, or 
other events where forensic specimens and samples are contaminated and degraded, NGS 
technologies will be essential for human DNA analysis and typing. NGS will make it 
feasible to analyze the usual autosomal DNA (STRs and SNPs), mitochondrial DNA, 
and X and Y chromosomal markers all at once and yield robust results (Alvarez- 
Cubero et al., 2017). 


The role of artificial intelligence and machine learning in NGS 


What is artificial intelligence? Why is it useful for forensics? 


Artificial intelligence (AI) is revolutionizing many industries, including the fields of 
forensic omics, genomics, and NGS. AI refers to the development of computerized sys- 
tems and solutions that can perform tasks that normally require human intelligence, such 
as perception, reasoning, learning, and decision-making. The field of AI encompasses a 
range of techniques, including machine learning (ML), natural language processing, com- 
puter vision, and robotics, among others. AI has the potential and capability to revolu- 
tionize and drastically transform many areas of different fields and industries, from 
healthcare and finance to transportation and education. 

One of the earliest and most influential definitions of AI was proposed by John 
McCarthy, who described it as “the science and engineering of making intelligent machi- 
nes” (McCarthy, 1956). Since then, the field has evolved and expanded significantly, with 
many researchers and practitioners offering their own definitions and perspectives. For 
instance, Russell and Norvig (2010) define Al as “the study of agents that receive percep- 
tions from the environment and take actions that affect that environment.” 

While AI has made significant progress in recent years, it is still a rapidly evolving field 
with many challenges and opportunities. Some of the key research areas in AI today 
include deep learning, ML, reinforcement learning, explainable AI, and the ethical and 
societal implications of AI. This multifaceted and intricate field is witnessing novel devel- 
opments and ceaselessly accelerating breakthroughs occurring frequently. In recent years, 
there have been several notable advancements in AI, including in the areas of natural lan- 
guage processing, computer vision, ML, and deep learning. All these areas can be very 
useful for biomedical sciences, forensics, and genomics. 

Despite these impressive developments, AI still faces many challenges, particularly in 
areas such as explainability, interpretability, fairness, and ethics. Explainability is especially 
crucial for users interacting with AI to recognize and fathom the AI’s interpretations, 
conclusions, and recommendations. As AI continues to evolve, it will be important for 
researchers and practitioners to address these challenges and ensure that AI is used in a 
responsible and beneficial way. This is even more central for forensic studies and case- 
work, especially from an ethical point of view. 

AT is increasingly being used in biomedical sciences and forensics to improve our un- 
derstanding of diseases and enhance forensic investigations. In biomedical sciences, AI has 
been applied to various tasks, including drug discovery, medical imaging analysis, and 
precision medicine (Gémez-Rios et al., 2021; Krittanawong et al., 2021). AI has also 
been used in precision medicine to develop personalized treatment plans based on an in- 
dividual’s genetic and medical history (Wang & Preininger, 2019). 

In forensics, AI has been used to analyze various types of evidence, including DNA, 
fingerprints, and surveillance footage. For example, AI can be used to analyze DNA se- 
quences to identify suspects or determine familial relationships. Facial recognition 
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software, which uses AI algorithms to match images to a database of known individuals, 
has also been used in forensic investigations (Jain et al., 2011). AI can also be used to 
analyze surveillance footage to identify suspects or track their movements. 

Most importantly, AI has been used in forensic genomics for various purposes, such as 
identifying suspects, predicting the physical traits of suspects from DNA samples, and 
identifying the tissue source of forensic samples. The use of AI in forensic genomics 
can be traced back to the 1990s, when early primitive ML algorithms were first applied 
to DNA sequencing data (Baldi & Brunak, 2001). 

In recent years, AI has been used for more advanced applications in forensic genomics. 
For example, in 2019, a team of researchers used a deep learning algorithm to predict the 
physical appearance of suspects from DNA samples, including eye, skin, and hair color 
(Chaitanya et al., 2019). In another study, researchers used AI to identify the tissue source 
of forensic samples based on gene expression patterns, which could aid in the identifica- 
tion of missing persons or unidentified human remains (Korsak et al., 2019). 

Despite the potential benefits of using AI in forensic genomics, there are also concerns 
about the reliability of AI algorithms, particularly with regard to bias and accuracy. In 
addition, there are ethical considerations regarding the use of Al in forensic investigations, 
particularly with regard to the privacy and potential misuse of data (Bennett, 2020). 

Considering all these points, we can comprehend that AI has a long history of use in 
forensic genomics, and recent advances in deep learning algorithms have enabled more 
advanced applications in this field. However, it is important to carefully consider the 
moral, ethical, and societal implications of AI in forensic investigations and to ensure 
that its use is in line with legal and ethical requirements and standards. 

In addition, despite acknowledging the potential benefits of using AI in biomedical 
sciences and forensics, there are also concerns about privacy, bias, and the reliability of 
AI algorithms. It will be important for researchers, investigators, and practitioners to 
address these concerns and ensure that AI is used in a responsible and ethical way. 


What is machine learning? 


ML is a subfield of AI that enables computer systems to routinely and automatically learn 
and improve from experience without being explicitly programmed. In biomedical sci- 
ences, forensics, and genomics, ML has the potential to revolutionize the field by 
enabling faster and more accurate analysis of large amounts of generated data. 

In biomedical sciences and forensics, ML has been already used for tasks such as med- 
ical image analysis, diagnosis of diseases, and drug discovery. For example, ML algorithms 
have been applied to medical imaging data to automatically detect abnormalities and assist 
with disease diagnosis (Litjens et al., 2017). ML has also been used to predict patient out- 
comes and develop personalized treatment plans based on a patient’s genetic and medical 
history (Miotto et al., 2017). 


The role of artificial intelligence and machine learning in NGS 


In forensics, ML has been learning how to analyze various types of evidence and data, 
including DNA, fingerprints, and surveillance footage. For example, ML algorithms can 
be used to analyze DNA sequences to identify suspects or determine familial relationships 
(Li et al., 2019). ML can also be used to analyze patterns in surveillance footage to identify 
suspects or track their movements. 

Despite the potential benefits of using ML in biomedical sciences, forensics, and ge- 
nomics, there are also concems similar to the ones raised about the overall use of AI in 
these fields. Indeed, matters like bioethics, privacy, bias, and the reliability of ML algo- 
rithms should be addressed. It will be important for researchers and experts to focus on 
these concerns, propose solutions to them, and ensure that ML is used in a responsible 
and ethical way. 


How do artificial intelligence and machine learning help with NGS? 


NGS generates vast amounts of data, making it challenging to analyze and interpret the 
results precisely. AI can help solve this problem by providing advanced algorithms and 
ML techniques to process and analyze NGS data quickly and accurately. Indeed, AI 
and ML are already used to improve the accuracy, reliability, efficiency, and speed of 
NGS. AI can be used to identify patterns in the data generated by NGS, which can 
help investigators and researchers interpret the results more quickly and accurately. 

Another primary application of AI in NGS is data preprocessing, which involves qual- 
ity control, adapter trimming, and filtering of low-quality reads. AI algorithms can auto- 
matically detect and remove sequencing errors and low-quality reads, which can 
significantly improve the accuracy and quality of the results. This is to say that AI can 
be used to identify and, consequently, flag many types of errors or abnormalities in 
the datasets, which can drastically reduce the risk of inaccuracies and mistakes in the pro- 
duced final results. This will augment very clearly the power, robustness, and consistency 
of NGS technologies (Behjati & Tarpey, 2013; Kiran & Akhade, 2021; Sivakumar et al., 
2021; Xu et al., 2021; Zhang et al., 2011). 

Aland ML are also being used to improve data analysis in NGS. Al-based systems and 
ML algorithms can be used to identify patterns, variances, and anomalies, such as rare ge- 
netic variants or gene expression patterns, in large datasets generated by NGS. This will 
help researchers and investigators comprehend and interpret the results and identify 
important insights. ML algorithms can also be used to predict how certain patterns can 
be affected by certain elements, like DNA degradation. In addition, AI and ML can 
be used to detect genetic mutations, genetic variability, or diseases, which could lead 
to the development of new human identification tests. 

Another important aspect is that AI and ML can be used to transform NGS method- 
ologies and techniques, enabling researchers to generate more reliable results in even 
shorter amounts of time. Furthermore, ML algorithms can be used to build mathematical 
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models and simulations that can calculate and foresee the expected outcomes of any NGS 
analysis based on input training data. A very similar prediction approach was successfully 
developed relying on a deep learning model has been created for predicting very accu- 
rately NGS results (Zhang et al., 2021). These tools can be used to verify and validate 
NGS experimental results and outcomes. 

Given the recognized influence of nucleic acid sequences on form and function and 
the exponential number of possible sequences of given lengths, the integration of gener- 
alizable standardized domain knowledge into deep learning architectures will be a note- 
worthy facilitator for predicting behaviors for nucleic acids in general (Zhang et al., 
2021). 


Challenges of utilizing Al and ML in NGS 


While it is obvious that AI and ML can be used to improve the precision, robustness, and 
speed of NGS, there are also some challenges associated with using these technologies. 
These include the need for high-quality data, the complexity of the algorithms used, 
and the cost of implementation. Another major challenge is the lack of standardization 
in NGS data. Different sequencing platforms compounded with different protocols 
can generate data of varying quality, making it challenging to compare and analyze 
data from different sources and rendering the repeatability of the analysis very 
complicated. 

Another challenge is the need for large amounts of high-quality data to train ML al- 
gorithms. Access to such data can be limited, particularly for rare diseases or disorders. In 
addition, the data generated by NGS can be complex and difficult to input and interpret, 
which can limit the effectiveness of Al and ML algorithms. Furthermore, many of the AI 
and ML algorithms used in NGS are still in the early stages of development, so it may be 
some time before they are completely ready for widespread use. However, with the cur- 
rent rate of innovation and development of new techniques, algorithms, and AlI-based 
systems and tools, this latency time can be shorter than expected. 


Conclusion 


NGS is doubtlessly a revolutionary technology that has enabled researchers to sequence 
large amounts of genetic material and nucleic acids from different types of samples and 
tissues quickly and cost-effectively. AI and ML have the potential to augment and 
enhance NGS methods and techniques and to uncover new insights from the data 
they generate. However, there are also some challenges associated with utilizing AI 
and ML in NGS, including the need for high-quality data and the complexity of the pro- 
cedures and systems used. With further research and development, AI and ML technol- 
ogies could be used to greatly improve the accuracy and speed of NGS and open new 
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possibilities in the field of forensics, like they are doing for other fields. We can conclude 
that the application of AI and ML in NGS is still in its early stages, and there are many 
challenges to overcome. However, with continued research and development, AI can 
help unlock the full potential of NGS for forensics, personalized medicine, genetic ana- 
lyses, and many other areas and fields. 
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Introduction to forensic DNA analysis 


Forensic identification using DNA typing has raised ethical concerns since it was intro- 
duced more than 35 years ago. The concerns include genetic privacy, consent, and using 
DNA to make predictions about a person’s physical appearance (Elkins, 2013; Scwab 
et al., 2018; Wee & Mozur, 2021). Additional concerns center on racial disparities in da- 
tabases and potentially developing leads to innocent people to crimes using familial DNA 
searches (Wickenheiser, 2019). While the earliest forensic DNA typing focused on com- 
parisons of counts of autosomal DNA short tandem repeat (STR) loci repeats, modern 
forensic DNA typing methods analyze autosomal (a), X chromosome, and Y chromo- 
some STR repeats via sequencing as well as single nucleotide polymorphism (SNP) ge- 
notypes and mitochondrial (mt) DNA haplotype variations (Jager et al., 2017). After the 
initial genetic loci were chosen for analysis, they were sequenced, genotypes were docu- 
mented, and population variations were recorded. At the time they were adopted, most 
STR loci were considered “junk” or noncoding DNA (Ohno, 1972). 

Nowadays, it is known that many of the chosen STR and SNP forensic loci are in 
gene-coding regions and code for phenotype polymorphisms (Arneson et al., 2018; 
Banuelos et al., 2021; Banuelos et al., 2022; Pombar-Gomez et al., 2015; Shademan 
et al., 2021; Wyner et al., 2020). Indeed, many phenotype polymorphisms are explicitly 
typed in the newest forensic DNA typing kits. Forensic SNPs are classified into several 
types, including ancestry-informative SNPs (aiSNPs), phenotype-informative SNPs 
(piSNPs), and identity-informative SNPs (iSNPs). Forensic DNA typing is used for 
not only identity typing but also for investigative leads and evidence analysis in criminal 
cases, including phenotype predictions, biogeographical ancestry predictions, monozy- 
gotic twin differentiation, and paternity and maternity lineage typing, missing persons, 
and cold cases (Elkins & Zeller, 2021; Weber-Lehmann et al., 2014) (Table 28.1). Addi- 
tionally, whereas early forensic DNA typing was limited to a few loci, panels of hundreds 
or even thousands of loci are now easily analyzed with forensic DNA typing kits using 
next-generation sequencing (NGS). NGS data has been approved for upload to the Com- 
bined DNA Index System (CODIS) United States national DNA database since 2019 
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Table 28.1 Types of forensic DNA typing and their applications. 


Loci Case 

aSTRs Identity, paternity, missing persons, cold cases 

Y-STRs Paternity, male lineage, missing persons, cold cases 

X-STRs Maternity, female lineage, missing persons, cold cases 

mtDNA Female lineage, missing persons, cold cases 

uSNPs Identity, missing persons, cold cases 

aiSNPs Missing persons, cold cases, investigative leads 

piSNP Missing persons, cold cases, investigative leads 

WGS or targeted NGS Identity (including identical twins), missing persons, cold 
cases, historic cases, investigative leads 


(Elkins & Zeller, 2021). Moreover, NGS DNA typing products are more specialized and 
yield more individualizing data, but they also require more analyst time and energy than 
ever before, as well as new data storage drives and mechanisms. Furthermore, whole 
genome sequencing (WGS) of an individual can be performed; WGS leads to the eluci- 
dation of the full genetic code for an organism or person. WGS can be invaluable when 
other DNA-typing approaches fail. 

Instead of being limited to comparing STR repeats, forensic scientists employing 
NGS can compare sequence variation and facilitate genetic identification. The STR 
target loci have been determined to be more polymorphic than previously known 
through the identification of isoalleles, alleles of the same length but different sequences, 
many times differing by a single nucleotide base (Berglund et al., 2011; Gettings et al., 
2017). NGS is facilitating the detection of mt heteroplasmy as well. Heteroplasmy and 
isoallelic variation can aid in DNA mixture deconvolution. 


Ethics and forensic DNA typing 


Since forensic DNA typing was introduced, bioethicists and citizens have raised 
numerous ethical and legal concerns (Elkins, 2013). Ethics can be defined as professional 
rules, norms, codes of conduct, and standards in a profession. The term law refers to a set 
of legal rules to regulate the actions of the members of a community. Ethics and laws 
should not be confused with morals, which are cultural norms and values governing 
behavior. Unintentional mistakes or instrument failures are not considered breaches of 
ethics. Herein, we will focus on ethical concerns of DNA typing. 

Ethical concerns surrounding forensic DNA analysis range from including DNA pro- 
files of individuals suspected of crimes in forensic databases after they are no longer sus- 
pects, keeping databases of missing persons from crime scene samples separate, obtaining 
consent for collecting and testing samples, using DNA typing only for crimes that are 
subject to forensic DNA typing, and expunging arrestee profiles uploaded to forensic da- 
tabases if the persons are exonerated from the crime(s) (Elkins, 2013). Drawbacks of 
forensic STR DNA typing have been raised, including matching and arresting the wrong 
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twin in monozygotic twin pairs. Additional concerns, such as whether DNA phenotyp- 
ing and ancestry predictions should be made, are more relevant than ever with the appli- 
cation of NGS. Several studies now point to forensic DNA kits uncovering disease 
phenotypes and the propensity for individuals to develop medical conditions later in 
life (Banuelos et al., 2022). Other ethical issues raised with genotyping and sequencing 
applications include the reidentification of donors used for clinical or forensic population 
studies (Schwab et al., 2018; Wjst, 2010; Yang, 2019). Concerns with familial DNA anal- 
ysis and the use of genetic genealogy have also been raised (Mateen et al., 2021). Ethical 
concerns raised include family members being used to lead law enforcement to their rel- 
atives to solve a crime without their consent and other privacy concerns (Court, 2018). 

Understanding the differences between ethics, laws, and morals in determining the 
appropriate use and communication of DNA typing results requires education and prac- 
tice that often extends beyond the university curriculum in chemistry and biology (Elkins 
& Fambegbe, 2020). Role-playing activities may help prospective law enforcement of- 
ficers and forensic lab staff experience and consider different viewpoints for a case. For 
example, to engage with the content psychologically and understand concerns, a law 
enforcement officer or scientist may play the role of an officer going to a home or work- 
place to collect a DNA sample under a warrant in a genetic genealogy case, and another 
may play the role of the relative. 


How does forensic genetic genealogy differ from familial DNA 
searches? 


STR DNA typing data analyzed using government forensic databases to associate crime 
scene DNA with biologically related individuals using STR data is referred to as a familial 
DNA search. Software exists that searches databases for individuals that share alleles with 
the suspect contributor profile and identifies biologically related parents, siblings, and 
children. X- and Y-STRs are used to narrow the list to individuals that share paternal 
lineage. Familial DNA searching can identify an unknown offender and is regulated to 
be used in certain circumstances, but not as a routine step for criminal investigations. 
The names of close biological relatives are provided to investigators for further investiga- 
tion. Concerns have been raised that the searches use relatives to identify the persons of 
interest in a criminal case (Katsanis, 2020). While this technique is banned in Maryland 
(our state) as of this writing, it is allowed in other states, including California, Colorado, 
New York, and Florida. In Montana, partial familial DNA searches are banned unless 
probable cause is found and a search warrant is issued. 

Most recently, NGS DNA profiles are being used for genetic genealogy, the applica- 
tion of DNA testing to trace/locate individuals through their genetic family members us- 
ing public, commercial genetic genealogy databases in cases for which National DNA 
Index System (NDIS) searches did not identify viable suspects (Wickenheiser, 2019). 
In investigative genetic genealogy (IGG), law enforcement officials upload forensic 
DNA SNP data from ForenSeq Kintelligence, or a similar commercial product, to public 
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genealogy databases, including GEDmatch, and compare to profiles from Ancestry, 
MyHeritage, and 23andMe, in order to scan the profiles of individuals who have opted 
in for law enforcement searches. The search can generate leads to genetic ancestors, 
including distant cousins, across multiple generations. Genetic genealogists generate fam- 
ily trees to identify modern descendants. This technique is also used in unique cases, such 
as murders, to identify serial killers and rapists. 

On September 24, 2019, the US Department of Justice announced its interim policy 
on forensic genetic genealogical DNA analysis and searching (FGGS) with an effective 
date of November 1, 2019. Support statements were issued by the American Society 
of Crime Lab Directors (ASCLD), the Federal Bureau of Investigation (FBI), and the 
Consortium of Forensic Science Organizations. 

On May 30, 2021, Maryland legislators passed a law regarding the use of IGG 
balancing public safety and the personal right to privacy that went into effect on October 
1, 2021. Montana has passed similar restrictions. A search warrant is required to perform 
the search after the court issues a finding of probable cause. In Montana, public consumer 
databases cannot be searched unless the individual waives their right to privacy. 

Maryland restricted the use of IGG to serious violent crimes, including committed or 
attempted murder, rape, a felony sexual offense, or a criminal act affecting public safety or 
national security (Ram et al., 2021; Taylor, 2021). In Maryland, the forensic sample must 
be biological material that investigators believe was deposited by a perpetrator and was 
collected from a crime scene, a person, an item, or a location connected to the criminal 
event, or the unidentified human remains of a suspected homicide victim. Traditional 
DNA typing, used to generate an STR profile from the forensic evidence and used in 
a CODIS search, must have failed to identify a known individual. Only public data 
from a direct-to-consumer (DTC) or public database that explicitly allows law enforce- 
ment to use it to investigate crimes should be used for the IGG search (Skeva et al., 2020). 

Current concerns include the use of FGGS and IGG for lead generation and tradi- 
tional investigation methods to follow up on or confirm a lead, especially when the 
FGGS or IGG is concealed in case notes, skirting legal and public concerns of its use. 


Misuse of tools indicated for forensic use? 


Off-label uses of products are common. For example, doctors can prescribe medicines 
and medical devices for off-label use if they have approved medical use for another con- 
dition. This has led to lives being saved. Parts and products can be used for purposes 
beyond the originally intended purpose. Similarly, IGG uses databases built by genealo- 
gists for law enforcement purposes. 

In forensic DNA, products have been carefully tested and labeled with quality control 
seals and lot numbers. These labels are a means to convey that the products are reliable, 
accurate, and guaranteed for their intended forensic use. However, certified or accredited 
forensic labs are not the only purchasers or users of these products. Academic and 
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government-standard labs also buy forensic DNA products. Because forensic science is 
the application of science to law, the independent evaluation and validation of products 
by labs, both academic and forensic, and subsequent publication is important for admis- 
sion to the court of law. However, off-label use is reportedly causing concern among 
companies. As of this writing, unlike for controlled substances, there are no restrictions 
or permissions needed to purchase forensic DNA products, including DNA extraction, 
quantitation, and DNA typing kits labeled for forensic use. 

One application of off-label use is to develop population databases using various 
forensic kits to generate allele frequency databases for statistical calculations of genotypes. 
Standards labs have sequenced and analyzed forensic kit amplicons to independently 
check the products for their reliability in generating data and in producing accurate allele 
calls. STR kits targeting the same loci have been compared and assessed to ensure that 
they generate the same genotypes for a set of research-type forensic samples. Labs have 
validated the kits in their environment and workflow and determined threshold settings 
for stutter and limit of detection. From these studies, primer binding site mutations, non- 
concordant genotypes at loci, loci with a high stutter, sequence variations, and additional 
STR alleles have been reported and published (Elkins & Zeller, 2021). 

A second off-label use is the reanalysis of NGS data collected in support of law 
enforcement or medical diagnosis activities, such as the early detection of cancer. Rean- 
alysis of the data by insurance companies could lead to the discovery of other genetic dis- 
eases or attributes and be used to deny patients insurance coverage in the future. 

Separate from these research purposes, state labs in China have been accused of off- 
label use of forensic DNA typing kits to profile populations or ethnic groups (FTM, 
2021). The groups have been reported to use DNA phenotype data from NGS to 
map human faces and profile or surveil the Uighur minority population (Wee & Mozur, 
2021). Companies are considering additional purchasing restrictions on their products to 
avoid future misuse. 


Personal genomics companies and consumer data in the forensic 
setting 


Personal genomic companies offer DTC SNP panel DNA testing. The panels have been 
shown to be associated with various diseases and disease risk factors as well as personal 
characteristics. As these are uncovered, they are shared with consumers. Companies 
such as 23andMe, MyHeritage, Family Tree DNA, and AncestryDNA offer DNA typing 
of large SNP panels for ancestry predictions and genetic genealogy purposes. Other com- 
panies developed for the Russian, Chinese, Brazilian, and East Asian markets, such as 
GenoTek, Yoogene, Genera, and WeGene, respectively, predict talent, health, sport, 
and fitness traits. These companies have normalized genetic testing and seek to normalize 
the use of DNA for predictive health, but not forensic, purposes. However, they lack the 
rigorous, transparent, independent, and validated testing of their products and analysis of 
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their data that the public expects from forensic laboratories. Additionally, their owners 
have other planned uses for the data. For example, 23andMe has received large sums 
from the pharmaceutical company GlaxoSmithKline (GSK) to use the data in drug 
development, and GSK bought a 300 million USD stake in the company in 2018. 
Another example is how Blackstone, the $63 billion private equity giant, acquired 
Acenstry.com in 2020, which included individual consumer DNA samples as well as 
the databases that Ancestry.com has built for customers to track their family tree. 

Consumers place trust in their data and findings and are using their data, such as that 
for BRCA gene mutations, to make life changing medical decisions, such as a preemptive 
mastectomy, when disease risk factors are identified. However, in some cases, this trust 
has been unfounded. Data has been found to be unreliable, wrong, or incorrectly inter- 
preted. The tests are not designed to give accurate results for rare mutations. In the begin- 
ning, datasets were small as well, although they continue to grow. 

Personal genomics companies also analyze SNPs from genes or gene regions that are 
associated with physical traits or personality characteristics. Some of these, including 
polymorphism predictors for hair color, eye color, skin tone, face shape, BMI, age, 
and height, are useful for forensic purposes (Elkins et al., 2023). The sequences need 
to be accurate in order to generate an accurate prediction. The predictions from com- 
mercial products developed for forensic applications are much more conservative. For 
example, forensic ancestry predictions using the ForenSeq kit are continental rather 
than regional by country, state, city, or town (Jager et al., 2017). Facial prediction remains 
challenging, but web-based software such as MetaHuman generates images based on 
inputted characteristics (Elkins et al., 2023). Since full profiles are not always obtained, 
forensic genetic genealogists have begun imputing the sequences at SNPs based on the 
genotypes at nearby SNPs (microhaps). Their justification is that the prediction is used 
only for lead generation and will be confirmed with additional methods. 


Forensic NGS and genetic genealogy 


The application of DTC NGS data to genealogy to help solve crimes using IGG (Wick- 
enheiser, 2019) is solving more cold cases faster than ever before. NGS was used to 
analyze a minor contributor in a sexual assault case (de Knijff, 2020) and identify Joseph 
James DeAngelo (no relation to the author) as the Golden State Killer in 2018 (Selk, 
2018). Since then, hundreds of cold cases have been solved using IGG. IGG can be 
used to not only identify perpetrators but also to rule out suspects. 

Privacy considerations are important for the application of DTC tools to cases (Bononi 
et al., 2020). Initially, GEDmatch required users to opt out of law enforcement searches. 
However, in 2020, they changed their policies to offer users the opportunity to opt into 
these searches rather than make them the default setting. Even with the changes, many 
people agreed to their data being used for investigative searches because of an innate desire 
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to “do good.” Forensic genetic genealogists are committed to adhering to their guidelines 
and terms of service, following state and federal laws, and adhering to SWGDAM’s IGG 
recommendations. The forensic raw DNA data for the suspect is used to search the data- 
base. GEDmatch does not reveal the raw data for chromosomal region matches in the 
database (Guerrini et al., 2021). After the database generates the matches, forensic genetic 
genealogists compare the results to nonDNA data, including social media and geograph- 
ical location. Still, some consumers did not fully understand what they were agreeing to 
when opting into these searches. While the use of the GEDmatch database for investiga- 
tive searches does not require contact with most individuals, when law enforcement needs 
to make contact, they obtain a warrant to get information, the names can be published in 
the news, and the person(s) may be asked to testify in court. Likewise, if a sample is 
requested from a suspect for probable cause, a court can decide if the request is reasonable. 
If a sample is retrieved from the trash, no warrant is required. 

The use of these database searches for forensic investigative leads has caused concern 
for the general public that their DNA may be used to connect them with past or future 
major crimes and has led to rape victims refusing to engage with law enforcement 
(Wired, 2022). Many argue that there should be limitations on law enforcement’s use 
of genetic genealogy databases to solve crimes (Guerrini et al., 2018). Most DTC com- 
panies do not delineate the process that enforcement agencies should use to request the 
use of their services for law enforcement purposes (Skeva et al., 2020). Those who are 
innocent and not related to a crime could become the center of an active investigation 
based on a match from a database search. This has caused the public’s distrust of not 
just the companies providing DTC NGS and genetic genealogy but law enforcement 
and forensic investigators as well. 

There are ethical concerns about NGS and genetic genealogy being used for surveil- 
lance and ideas about the development of a population database inclusive of everyone in 
the population. Additionally, there is concern regarding which cases law enforcement 
should be pursuing to keep the population safe. Genetic genealogy has been used to solve 
old cases for the identification of abandoned fetal human remains (Medina, 2022). While 
the circumstances of the cases—for example, if the fetus took a breath—are not clear 
from the skeletal remains alone, women are being arrested, and their cases are being 
brought to court to decide if they buried a miscarried fetus or killed or abandoned a 
baby years or decades earlier. 


Next-generation sequencing and environmental impact 


NGS, especially WGS, produces gigantic datasets. Data handling and storage are two of 
the significant problems challenging forensic labs seeking to implement NGS into their 
workflow (Alonso et al., 2017). Storing and backing up the data requires large servers or 
data centers. Manipulation and analysis of the data require computational cores, web- 
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based software, or computational tools. Servers and computer cores are energetically 
expensive. Although some are entirely powered by wind and solar farms, others are pow- 
ered by fossil fuels. Moreover, powering servers uses energy resources that could be allo- 
cated for other uses. Google data centers use more power than entire countries (Forbes, 
2021). Application of the new NGS technology to all criminal cases would present a 
huge national increase in energy usage. 


Next-generation sequencing education needs 


In 2008, Maryland became the only state in the US to ban the statewide DNA database 
for familial DNA searches. Thirteen years later, Maryland law HB240, “forensic genetic 
genealogical DNA analysis, searching, regulation, and oversight,” authorizes forensic ge- 
netic genealogy analysis. The FGG law went into effect on October 1, 2021, and includes 
a requirement for training and certification for genealogists. On September 24, 2019, the 
US Department of Justice (DOJ) announced an interim policy on FGG DNA analysis 
and searching, effective November 1, 2019. 

Although training and certification were regarded as important for inclusion, Mary- 
land did not specify the training requirements to meet the certification requirements or 
the type of training and certification to meet the requirement. Numerous FGG courses of 
different lengths are available, including online and in-person options (Table 28.2). 


Table 28.2 Forensic genetic genealogy courses. 


Institution Course/program title | Website 
Brigham Young Certificate in family https://www.byupathway.org/certificate/ 
history research family-history-research 
Bureau of Justice Forensic genetic https://bja.ojp.gov/news/registration- 
Assistance, US DOJ genealogy virtual open-forensic-genetic-genealogy- 
training training 
Boston University Online certificate in | https://genealogyonline.bu.edu/ 
genealogical 
research 
Global Forensic and The fundamentals of | https://shop.gfjc.fiu.edu/product/the- 
Justice Center at investigative fundamentals-of-forensic-genetic- 
Florida International genetic genealogy genealogy/ 
University 
National Criminal Enhancing https://ncjtc.fvtc.edu/trainings/ 
Justice Training investigations TRO0011181/enhanced-investigations- 
Center at Fox Valley through genetic genetic-genealogy 
Technical College genealogy 
National Institute for Forensic genealogy https://www.genealogicalstudies.com/ 
Genealogical eng/courses.asp?courseID=36 


Studies 
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Table 28.2 Forensic genetic genealogy courses.—cont’d 


Institution 


Family Tree University 


University of New 
Haven 


Genealogical Research 
Institute of 


Course/program title 


Genetic genealogy 
101 (online) 

Forensic genetic 
genealogy online 
graduate certificate 

Practical genetic 
genealogy (online) 


Website 


https://university.familytreemagazine.com/ 
courses/genetic-genealogy-101 

https://www.newhaven.edu/lee-college/ 
graduate-programs/certificates/forensic- 
genetic-genealogy/index.php 

https://www.gripitt.org/courses/practical- 
genetic-genealogy/ 


Pittsburgh 
University of Online postgraduate | https://www.strath.ac.uk/studywithus/ 
Strathclyde qualification in centreforlifelonglearning/genealogy/ 


genealogical 
studies 

Honors seminar 
advanced topics: 
Forensic 
genealogy and 
historical research 


geneticgenealogycoursesresearch/ 


Towson University https://catalog.towson.edu/undergraduate/ 


course-descriptions/honr/ 


The University of New Haven’s 12-credit online graduate certificate in forensic genetic 
genealogy, launched in January 2021, is arguably the most in-depth program available at 
the time of this writing. To earn the graduate certificate, students take semester-long 
courses in fundamentals of forensic biological evidence, forensic genetic genealogy, 
genealogical principles and methods, and FGG practicum, in which they complete a 
case research project. Other FGG courses and certificates are offered by Brigham Young 
University, the Bureau of Justice Assistance (BJA), Boston University (BU), the National 
Criminal Justice Training Center (NCJTC) at Fox Valley Technical College (FVTC), the 
Global Forensic and Justice Center (GFJC) at Florida International University (FIU), 
Family Tree University, the Genealogical Research Institute of Pittsburgh (GRIP), the 
National Institute for Genealogical Studies, the University of Strathclyde, and Towson 
University (TU). The courses offered by the BJA, the GFJC, Family Tree University, 
and GRIP are week-long courses, while the NCJTC is a 2-day training course. Most 
of the courses are online, although the NCJTC, TU Honors College, and GFJC courses 
are offered in person. The FVTC course is a 2-day course, while the GFJC and TU 
courses are 40-hr week-long or semester-long courses, respectively. Students and profes- 
sionals can earn college credit from the university courses from BYU Global, BU, the 
University of New Haven, the University of Strathclyde, and TU. 

Some specifics need to be clarified. Is credit-bearing education required to meet the 
certification requirements? Should there be a minimum grade point average? Should 
there be a certification exam to attain and maintain the certification to become an 
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investigative genetic genealogist? Maryland’s law stipulates that a licensing program for 
laboratories performing SNP or another sequencing-based DNA testing on FGG evi- 
dence is established by October 1, 2022, and that a licensing program for individuals per- 
forming genetic genealogy is established by October 1, 2024. 


Conclusion 


As DNA analysis remains well-characterized and NGS technology has proven to be 
sound and acceptable for forensic applications, including upload to CODIS, the applica- 
tions and uses of the products and technology need to be transparent and ethical. There is 
general consensus surrounding the use of IGG to solve murders, but the priority of other 
types of cases needs to be clarified. The data analysis and statistical tools’ technological 
underpinnings and software calculations need to be open, transparent, and published. 
The communication of investigative results obtained from IGG needs to be clear and 
concise for all parties involved. This is not only legally sound but also ethical, so that 
the public maintains trust in law enforcement and forensic labs. 
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Introduction 


Next-Generation Sequencing (NGS), otherwise known as Massively Parallel Sequencing 
(MPS), or High Throughput Sequencing (HTS), promises to revolutionize the field of 
forensic science in supporting the activities of the criminal justice system. These tech- 
niques relate to DNA sequencing technologies through which multiple pieces of 
DNA are sequenced in parallel. NGS generates large amounts of data and thus it accel- 
erates the broadening of DNA analysis beyond Short Tandem Repeats (STR) toward a 
wider array of genetic markers, in diverse applications such as forensic DNA phenotyp- 
ing, ancestry assignment, and full mitochondrial analysis (Daniel et al., 2015; de Knijff, 
2019; Hunter, 2018; Scudder et al., 2018b), while also opening new avenues for micro- 
bial forensics (Kuiper, 2016; Oliveira & Amorim, 2018). 

Besides NGS’s higher multiplexing capacity of combining different genetic markers, 
one other assured advantage of these techniques is to allow the search for “discernible 
uniqueness”. They are therefore considered to be a real breakthrough in distinguishing 
identical twins (Amorim & Pinto, 2018). To summarize, NGS generates a vast amount 
of personal genomic data with unprecedented efficiency, speed, sensitivity, and depth of 
information that is expected to improve the fight against crime in the near future (Butler 
& Willis, 2020; de Knyff, 2019; Scudder et al., 2018a). 

NGS amplifies the possibilities of genetic informativity by introducing direct or indi- 
rect analysis of coding regions of DNA, therefore generating very sensitive information, 
including inferences about health conditions. In this context, the boundary between 
“medical” and “forensic” has—or will soon become—a matter for dispute (Bonomi 
et al., 2020; Lefevre, 2018; Machado & Silva, 2015; Wienroth et al., 2014; Williams 
& Wienroth, 2017). 
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NGS has begun to be developed in bioengineering and biomedicine, but it has also 
been attracting the attention of forensic geneticists wanting to develop new methods for 
the identification of suspects of criminal acts (Alvarez-Cubero et al., 2017; Aly & Sabni, 
2015; Ballard et al., 2020; Berglund et al., 2011; Budowle et al., 2017; Butler, 2015; 
Borsting & Morling, 2015; Imam et al., 2018; Kuiper, 2016; Oliveira & Amorim, 2018; 
Parson et al., 2016; Ryan et al., 2020). A recent European-wide online survey targeting 
forensic practitioners, involving 105 replies from participants representing police, govern- 
mental, academic, and private laboratories providing professional services in the field of 
forensic genetics in 32 European countries, showed that 73% already own an NGS plat- 
form or are planning to acquire one within the next 1—2 years (Gross et al., 2021). 

Most of the existing literature about the ethical and social challenges of NGS focuses 
on the clinical medicine and health research domains (Ballard et al., 2020; Clarke, 2014; 
Curtis et al., 2019; Kochariski & Demkow, 2015; Martinez-Martin & Magnus, 2019). 
However, it is recognized that more attention must be paid to the Ethical, Legal, and So- 
cial (ELSI) challenges of the uses of NGS in the field of criminal identification (Amorim 
& Pinto, 2018; Curtis et al., 2019; Samuel et al., 2018; Scudder et al., 2019). This chapter 
aims to contribute to filling this gap in ELSI literature by adopting the lens of a sociolog- 
ical understanding of NGS technologies as “promissory entities” which are framed by 
“expectations” toward the future development of forensic genetics (Tutton & Levitt, 
2010). As suggested by Mike Fortun (2006), promissory narratives cannot be reduced 
to either hype, hope, or speculations, but occupy the uncertain, difficult space in be- 
tween. The promise and related statements involving the future may be volatile, but 
they are also necessary to society and science, especially with regard to emergent technos- 
cience in the forensic field (Wienroth, 2018b). This is particularly relevant given that 
“NGS poses massive challenges to science and society” (Bésl & Samida, 2021, p. 15). 

Besides being aware of the potential of NGS technologies for supporting the activities 
of the criminal justice system, it is vital to attend to their ethical and social issues, 
including conflicts with human rights, especially in terms of privacy (Hong et al. 
2015; Lippert et al., 2017; Martinez-Martin & Magnus, 2019; Ong et al., 2011; Schwab 
et al., 2018), and discrimination (Dupras et al., 2020; Duster, 2015; Moreau, 2019; 
Wauters & Van Hoyweghen, 2016). The security of genome-wide data and ethical 
and legal issues of data sharing should also be carefully and publicly debated (Bonomi 
et al., 2020). Such discussions are crucial to elaborate recommendations to support 
NGS’s ethically responsible implementation and use in forensic practice, and within 
the criminal justice system more broadly, since the technology has advanced faster 
than the capacity to introduce safeguards. 

Attempting to contribute to this debate, in the first sections of this chapter, we point 
to important challenges and potential issues related to the uses of NGS in the criminal 
justice system, by addressing the emerging ethical, legal, and social issues that deserve spe- 
cial attention and protection in legislation and forensic practices. In the final section, we 
detail the social and cultural aspects that surround the understanding of NGS’s 
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technologies as “promissory entities” framed by “expectations”. We argue that the 
perceived role of NGS for the future development of forensic genetics is at least partially 
explained by macro-level societal trends toward the “datafication” of society (Cukier & 
Mayer-Schoenberger, 2013; Sadowski, 2019; Van Dijck, 2014), expansion not only of 
massive surveillance in the digital world (Bellanova, 2017; Lyon, 2014), but also of sur- 
veillance capitalism (Lyon, 2017, 2019), and a trend toward an emphasis on the “‘action- 
able genome” (Houtman et al., 2021) that has led to a “participatory turn” in forensic 
genetics (Granja, 2020; Jasanoff, 2003). We conclude this chapter by calling for the 
need to develop guidelines for international bioethics focusing on NGS’s uses and other 
genomic technologies in support of criminal justice activities in future years. 


Gaps in legislation and regulation 


Ethical and legal frameworks around forensic genetics are heavily influenced by the 
“non-coding” nature of the current markers, which allegedly has little connection to 
physical traits (except sex determination) (Curtis et al., 2019), and is perceived as not 
revealing sensitive information about individuals and their biological relatives. Much 
of the existing legislation on the collection and use of DNA in Europe, therefore, empha- 
size the restriction of forensic DNA to “non-coding” or “uninformative” regions of the 
human genome (Reed & Syndercombe-Court, 2016). However, this paradigm is being 
challenged by new research suggesting that STRs have a regulatory role in gene expres- 
sion (Hunter, 2018). 

In addition, forensics is in the process of a dramatic shift from these “non-coding” 
DNA methods. With the advent of high-throughput NGS, the generation of data at the 
nucleotide level beyond that of STR profiles alone has allowed laboratories to produce 
much more wide-ranging DNA information—for example, single nucleotide polymor- 
phism (SNP), which can lead to greater discrimination between alleles (Gross et al., 2021; 
Slabbert & Heathfield, 2018). As more of the human genome is sequenced in a forensic 
context, then more attributes of the donor are revealed. These emerging opportunities 
for NGS technologies applications are problematic concerns, since it is unclear how 
well existing legal and regulatory mechanisms and encryption algorithms developed 
for genetic information are suited for the protection of large datasets in genome-wide 
analysis in forensics. Publicly available whole-genome datasets may become a resource 
for the reidentification of individuals or their relatives by linking their whole genome— 
derived STR profiles with common identifiers in DNA STR identification profile data- 
bases (Hong et al., 2015). This is problematic from a legal and ethical point of view since 
the growing use of NGS technologies might be considered a breach of genetic privacy by 
introducing direct or indirect analysis of coding regions of DNA. These require that suf- 
ficient safeguards must be implemented to guarantee, for example, the encryption of ge- 
netic data in order to protect the individuals’ personal information (Skeva et al., 2020). 
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Considering that the donation of genetic profiles to genealogical databases and their 
subsequent use for criminal purposes entail consent issues, there are no legal provisions for 
destroying or returning the data when used for criminal investigations (Cho & Sankar, 
2004). This new era of forensic genetics, therefore, exposes the inadequacy of current 
legislation (Curtis et al., 2019; Dupras et al., 2018; Samuel et al., 2018), and a “lack of 
an adequate legislative framework” (Butler & Willis, 2020, p. 354). As Curtis et al. 
(2019, p. 1484) stated “the law is already failing to catch up to advances in genetic tech- 
nology.” So, this emphasizes the need to think about current legislation to guarantee full 
and proper protection for genetic data (Curtis et al., 2019). 


Access to databases that hold full-genome data 


Genome-wide data, either in commercial databases aimed for recreational purposes or in 
databases for research and clinical purposes, continue to grow at a high rate, raising tech- 
nical, ethical, and legal challenges (Borry et al., 2010; Chow-White et al., 2018; 
Kalokairinou et al., 2018; Kennett, 2019; Tutton, 2004). This means that DNA databases 
are becoming larger, in terms of the types of data included, and the number of people 
whose data are stored in them (Samuel et al., 2018). 

The difficulties of the application of genome-wide data in forensics are of a statistical, 
computational, economic, and technical validation nature. Amorim and Pinto (2018, p. 
104) emphasize the following question: “what to do with a substantial quantity of sen- 
sitive information that is not essential for identification purposes, but relevant for infer- 
ring traits that might be medically important or not?” The authors also claim that 
traditional ethical perspectives, as well as specific regulations, are clearly against the infer- 
ence of information beyond identification purposes, and most current forensic DNA da- 
tabases explicitly exclude any markers that can reveal physical traits or which are situated 
at coding regions (Amorim & Pinto, 2018). 

However, while state-regulated DNA databases tend to limit the collection and 
archiving of data to DNA noncoding data, recent events have shown the increasing 
willingness of police agencies to access commercial databases for recreational purposes 
where citizens voluntarily upload genetic information to know more about their 
health and/or ancestry. Although these commercial tests have been validated by usage 
by genealogists, they have not been validated for forensic use. Nevertheless, since 
2018 commercial databases aimed at recreational purposes have been used to identify 
suspects and missing persons by law enforcement agencies (Kennett, 2019). Such da- 
tabases are not circumscribed by restrictive legislation and hold full-genome data 
(Bonomi et al., 2020; Granja, 2020; Kennett, 2019; Samuel et al., 2018; Samuel & 
Kennett, 2020a, 2020b; Schwab et al., 2018). Consequently, it is becoming increasingly 
challenging to guarantee that only noncoding DNA can be used for forensic purposes 
(Samuel et al., 2018). 
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The traditional ELSI debate in the field of forensic genetics usually focuses on forensic 
DNA databases created for criminal investigation purposes (Hindmarsh & Prainsack, 
2010), so it does not capture issues that arise when data are used for other purposes 
beyond those for which it was originally collected, such as commercial DNA databases 
aimed for recreational purposes subsequently used in law enforcement investigations. 
Nonetheless, an ethically informed debate is urgently needed, considering that “in 
high-profile criminal cases it will be difficult to resist the political push to make data in 
these databases open to using in criminal investigations” (Samuel et al., 2018, p. e21). 

In addition, there are several issues related to (the lack of adequate) consent from ge- 
netic genealogy database users (Samuel & Kennett, 2020a; Skeva et al., 2020). Specif- 
ically, the individuals may share their data without understanding the potential 
implications for themselves as well as their current and future blood relatives (Bonomi 
et al., 2020). Many people in genetic genealogy databases have not given informed con- 
sent to law enforcement, since most companies do not clearly inform users that their in- 
formation may be subject to forensic analysis (Berkman et al., 2018). This has 
consequences both for those who voluntarily gave data for genealogy purposes which 
are now being used for forensic purposes, and for relatives who may be identified 
(Kennett, 2019). It is therefore necessary to point out the importance of obtaining con- 
sent from database users to allow access by law enforcement officers (Samuel & Kennett, 
2020a). To this end, new legal standards should be designed to guarantee that the indi- 
viduals are adequately informed about the future purposes for which their data can be 
used (Bonomi et al., 2020). Until now, how specific the consent should remain is an un- 
resolved issue (Berkman et al., 2018). 

Significant public concern has already been reported about access to and use of ge- 
netic genealogy data held in commercial databases (Berkman et al., 2018), but the discus- 
sion has been limited, disjointed, and unfocused (Butler & Willis, 2020; Curtis et al., 
2019; Phillips, 2018). Among the most debated topics are issues involved in conceiving 
publicly accessible data as holding intelligence value for criminal investigation purposes, 
since certain criminal cases may justify access to noncriminal databases (Samuel et al., 
2018). As Berkman et al. (2018, p. 1) stated, “a person might reasonably be surprised 
if his or her genealogic data were used in a criminal investigation, because that use is 
far afield from the original purpose for which the information was given”. Forensic 
genetic genealogy also goes to the heart of the broader issue of surveillance, and the 
extent to which the privacy of citizens can be sacrificed to enforce the law (Curtis 
et al., 2019; Scudder et al., 2020), by expanding “surveillance capitalism” (Lyon, 2019, 
p. 66) where data is increasingly collected and accessed. 

Finally, another topic of concern relates to the potential of the blurring of forensic and 
nonforensic databases. As forensic genetics and medical genetics converge toward 
genome sequencing, issues surrounding genetic data become increasingly intertwined 
and cannot be viewed in isolation (Machado & Silva, 2015; Samuel et al., 2018; Williams 
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& Wienroth, 2017), although less attention has been given to the use of genetic data for 
nonmedical purposes (Cho & Sankar, 2004). The speed of these developments has sur- 
prised many and demands a policy response to protect trust in medical genetics and in 
forensic genetics (Curtis et al., 2019), as the same data can be used both for medical 
and forensic purposes (Cho & Sankar, 2004). This is possible since some jurisdictions 
have made provisions allowing the use of health information by law enforcement in cir- 
cumstances deemed to be reasonable (Curtis et al., 2019). This requires a change in the 
current legal standards to guarantee that access to genetic data is transparent and given 
oversight by independent bodies (Samuel et al., 2018). 


Data protection 


The recent use of genomic data for forensic purposes has brought privacy and data pro- 
tection concerns to the attention of the general public (Bonomi et al., 2020). Data pro- 
tection is achieved by enforcing restrictions on destruction and (unauthorized) access and 
actions, using technical tools. Data privacy is focused on the use and governance of per- 
sonal data, and thus focused on the development of and adherence to administrative pol- 
icies (Carter, 2019; Curtis et al., 2019; Kennett, 2019; Martinez-Martin & Magnus, 
2019). 

Considering that the traditional privacy models designed, for example, for health data 
provide limited protection, the same issues arise when we apply these models to NGS 
technologies in forensic genetics. The expansion of genealogy databases, which contain 
millions of DNA profiles (Kennett, 2019), and the increasing possibilities of accessing 
public personal information associated with genomic data (for example, family names, 
geographic data, and social media) (Bonomi et al., 2020), emphasize the importance of 
understanding how, in the future, a higher level of protection of individuals can be pro- 
vided in the context of these new scenarios. 

In addition, the recent tendency to adopt cloud services to support the amount of 
computational capacity required to analyze these large datasets raises new cybersecurity 
issues and threats to privacy and data protection. Since the threats to cybersecurity and 
privacy are increasing, this can pose significant risks to data donors, as their information 
can be accessed by unauthorized hacking activities (Carter, 2019). Data protection is ur- 
gently needed as there are increasing moves to expand informativity, for example, by us- 
ing a combination of genomic data and phenotypic traits gathered from public databases, 
photos, and social media (Kim et al., 2018; Lippert et al., 2017). 

The use of commercial databases aimed for recreational purposes to identify criminals, 
together with the advancing science of trait prediction in forensic investigations (Granja 
et al., 2020; Granja & Machado, 2020; Lippert et al., 2017; Scudder et al., 2018b; 
Wienroth, 2018a, 2018b), and other full-genome wide analysis are crucial advances 
that not only mark the start of a new era in forensic genetics (Machado & Granja, 
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2020) but also raise additional urgent challenges to the application of NGS technologies. 
As commercial and other human genetic information databases, containing genome-wide 
data, continue to grow at their current high rate, the possibility of obtaining a sample from 
a given individual and then matching it to a supposedly anonymous sample in a different 
database is no longer theoretical (Carter, 2019). Martinez-Martin and Magnus (2019) 
stated that it is estimated that in a few years with NGS and the growth of databases, it 
will be possible to identify almost anyone in the U.S.A. from a DNA sample, even if 
they had not voluntarily placed their genetic information in the public domain. Moreover, 
this poses data protection questions regarding whether the database users know that their 
DNA information will be searched by law enforcement in this way (Samuel & Kennett, 
2(020a). It is mandatory to respect individuals’ rights to be adequately informed about how 
their data are being used (Skeva et al., 2020). 


Algorithms, privacy, and discrimination 


Whole-genome sequencing datasets contain information on current and future diseases 
and are essential for the future discovery of genetic etiologies of disease (Martinez- 
Martin and Magnus, 2019; Wauters and Van Hoyweghen, 2016) and have also been 
used to access biological relationships, including ancestry and parentage (Bruck and Frie- 
man, 2021), as well as reopening debates about genetic ancestry (Burmeister, 2021; Koh- 
ler, 2021; Lang and Winkler, 2021). To protect the DNA privacy of individuals 
participating in whole-genome sequencing, there needs to be constant vigilance to iden- 
tify and protect against the possible misuse of information (Bonomi et al., 2020; Hong 
et al., 2015). In this regard, the issue of maintaining the confidentiality of data is partic- 
ularly important. As outlined by Schwab et al. (2018), with NGS, profiled individuals 
may be reidentified through raw genetic data, even when social identifiers (name, date 
of birth, etc.) are removed and/or by identification through family members (see also 
Hong et al., 2015). This implies that if an individual decides to do genetic testing, the 
genetic privacy of the whole family (both current and future members) is implicated 
in such an option (Moreau, 2019). 

NGS technologies generate large datasets that require software and algorithms to pro- 
duce forensic analysis (Amorim & Pinto, 2018; Martinez-Martin & Magnus, 2019), 
which raises the potential for bias in the data used to construct the algorithms and bias 
in the algorithms themselves. If the dataset does not accurately reflect the population 
to which it is applied, then bias will be transferred to the outcomes generated by the al- 
gorithm. As pointed out by Martinez-Martin and Magnus (2019), the problem of bias 
presents a particular concern in the applications of NGS because of the historical lack 
of racial and ethnic diversity in human genetic and genomic research. This bias, together 
with police prejudice toward particular ethnic and racial minorities, raises the potential 
for differential treatment or abusive profiling of individuals or groups based on their 
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actual or presumed genetic characteristics (Crutchfield et al., 2012; Hunter, 2018; 
M’charek et al., 2013). 

Forensic DNA databases usually overrepresent certain groups and social categories, such 
as racial and ethnic minorities (Duster, 2003; Skinner, 2013), since these groups are the 
preferential target of the actions of the criminal justice system. Although nonforensic 
DNA databases, such as medical, commercial, and research databases, usually present the 
opposite trend, lacking diversity in their composition, the increasing genetic informativity 
brought by NGS might nevertheless lead to new forms of discrimination. In this regard, 
two examples stand out. One regards the potential offered by NGS to augment the infer- 
ence of phenotypic traits. By improving the predictions of forensic DNA phenotyping, 
and thereby increasing the visibility of racial or ethnic differences, NGS might contribute 
to the reproduction of stigmatization and reinforce the criminalization of certain groups 
(Lippert et al., 2017; Scudder et al., 2018a). Another example shows the contribution 
of NGS to forensic genealogy. By expanding genetic informativity, NGS might contribute 
to placing certain families under the lens of criminal investigation (Hong et al., 2015). 

As outlined by Wauters and Van Hoyweghen (2016, p. 276), genetic discrimination 
refers to “the differential treatment of asymptomatic individuals or their relatives on the 
basis of their real or assumed genetic characteristics” which, up until recently, has been 
discussed mainly in terms of genetic diseases, usually referring to practices adopted by in- 
surance companies and/or employers (Dupras et al., 2018, 2020). However, more 
recently researchers have been outlining how “epigenetic discrimination” might also 
significantly affect criminal investigations. As the interest in epigenetics applied to the 
field of criminal investigation rises (Haddrill, 2021; Hasnain, 2019; Vidaki et al., 
2013), new avenues for differential treatment and abusive profiling also expand (Dupras 
et al., 2018). For instance, associations between epigenetic variants and deviant and crim- 
inal behaviors can maximize discrimination toward certain groups (Dupras et al., 2020). 

The uses of algorithms and Big Data techniques' for NGS analysis might also lead to 
“black box” problems, where results or findings that arise from an automated tool are 
perceived as inherently more objective or accurate (Christin, 2020; Lefevre, 2018; 
Martinez-Martin & Magnus, 2019). The study of algorithmic “black boxes” reveals 
that the discourse focused on technological magic might serve as a cloud of smoke 
that obscures the role of algorithms in reproducing these social processes of discrimination 
and surveillance (Brayne, 2017; Christin, 2020). As previously discussed in other studies 
on the application of Big Data and algorithms in criminal investigation (see e.g., Brown- 
ing & Arrigo, 2021; Minocher & Randall, 2020; Sanders & Sheptycki, 2017), they are 


" Generally characterized by aggregate, processing, and analyzing automatically massive datasets to pro- 
duce algorithms to enhance the effectiveness and efficiency of decision-making (Christin, 2016; Kitchin, 
2014; Stevens et al., 2018). 
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promoted under a veil of objectivity and veracity, building social expectations around 
their ability to produce unquestionable results (Boyd & Crawford, 2012). These wide- 
spread beliefs in Big Data techniques can mobilize arguments around NGS analysis as 
factual and capable of generating intelligence in a way that was not previously possible. 
Consequently, this “faith in the technology” (Lyon, 2014, p. 6) to sequence DNA auto- 
matically and rapidly to identify criminal suspects may expand social imaginaries associ- 
ated with forensic genetics (Lynch et al., 2008) and shape expectations around NGS as 
capable of producing irrefutable truths. These arguments are fallacious and have ethical, 
legal, social, and cultural consequences. Firstly, they assume that these processes are char- 
acterized by transparency and participatory engagement, guiding neutral and objective 
decisions (Smith et al., 2017). Secondly, they establish trust in society to believe in the 
“epistemic capabilities of algorithms” (Aradau & Blanke, 2015). The analysis of these 
“black box” problems clarifies that all automated processes have errors and may repro- 
duce erroneous conclusions (Martinez-Martin & Magnus, 2019). Therefore, it is neces- 
sary to begin “demythologizing” (Amorim, 2012, p. 259) algorithms and Big Data for 
NGS analysis as automated techniques capable of producing objective, neutral, and irre- 
futable results to fight crime, since these social expectations may influence the new di- 
rections that NGS application in forensic criminal identification is assuming. 

Another problematic issue is the lack of transparency of industry developers who are 
generally reluctant to share information about the software they are designing and selling 
that would help forensic experts to evaluate the specific reasoning behind the outcomes 
generated (Amorim & Pinto, 2018; Granja & Machado, 2020; Hunter, 2018; Wienroth, 
2(018a). Increased transparency of these techniques and their creation processes is needed 
to open these “black boxes” of algorithms and illuminate how these mechanisms arise and 
create criminal suspicion (Leese, 2014). Considering that the producers of the technolo- 
gies rarely answer questions about how they are created (Leese, 2014; Martinez—~Martin & 
Magnus, 2019), and the error rates of NGS platforms are higher when compared to other 
techniques (Ballard et al., 2020), the involvement of forensic experts and scientists in prob- 
lematizing NGS analyses in their early stages is urgent. It could create opportunities to 
open the “black boxes” of techniques and algorithms, and reduce their errors and bias. 


Societal trends framing NGS technologies 


Current and potential future uses of NGS technologies cannot be decontextualized from 
the historical, social, and cultural environment. Molecular technologies and genetic da- 
tabases in the forensic context, including criminal identification, are often framed in 
terms of their societal benefits. They can help solve crimes by identifying suspects, but 
the uses of these technologies are also entangled with histories of coercion, racial surveil- 
lance, and the risk of stigmatization (Duster, 2003, 2006; Hindmarsh & Prainsack, 2010; 
Ossorio & Duster, 2005; Skinner, 2013). 

Previous studies reveal that experts and scientists are aware of these ethical challenges 
and pose questions about how to use NGS techniques in an ethically sound way. These 
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ethical concerns, and the awareness of the ethical issues associated with NGS technolo- 
gies, reveal two crucial aspects of the professional culture of the forensic geneticists’ com- 
munity. On the one hand, an adherence to the principles of anticipatory governance 
(Wienroth, 2018b) in which scientists seek to agree on appropriate practices in contexts 
tainted by controversy and uncertainty, ensuring the legitimacy of scientific knowledge 
and practice while invoking other values such as notions of accountability and transpar- 
ency. On the other hand, invoking ethical challenges of NGS technologies allows an 
active practice of demarcation in relation to other agents active in the field of NGS 
techniques—clarifying the scientists’ specific practices and rationales, while distancing 
themselves from private and commercial companies as well as from law enforcement 
agencies (Granja & Machado, 2020). 

The exposure of the ethical challenges of the uses of NGS technologies is embedded 
in the sociality of science and in the way scientific work is legitimated. The awareness of 
privacy issues, as well as of discriminatory practices potentially generated by the flawed 
uses of NGS techniques employed by forensic practitioners, reveals the management 
of controversies, both of which are seen as ways to lend legitimacy and objectivity to sci- 
entific work (Granja & Machado, 2020; Machado & Granja, 2018). This sociological 
approach to published literature on the main ethical, legal, and social issues related to 
NGS in the forensic field allows us to understand how ethics plays a role in the ways 
in which scientists relate to each other and negotiate responsibilities both within and 
outside their fields of expertise (Knorr-Cetina & Mulkay, 1983; Latour, 1987). 

NGS technologies in the forensic field are mostly perceived as “promissory entities” 
since despite being in an exploratory stage and barely used in routine casework, most 
forensic geneticists share the expectation that these technologies would be completely 
established in the forensic field scenario in the next decade, marking a “revolution”. 
Its promissory character that underscores the revolutionizing aspect of NGS relates to 
the expectation that these techniques will allow the generation of a vast amount of per- 
sonal genomic data with unprecedented efficiency, speed, sensitivity, and depth of infor- 
mation. This underpins the eagerness of health systems, corporations, and forensic 
initiatives to follow these productivities, efficiency, and power trends that characterize 
“surveillance capitalism” (Lyon, 2019, p. 66). Considering this macro context, NGS 
should be put into perspective as a materialization of this “surveillance capitalism” that 
marks the 21st century and is characterized as a collective desire to explore many and var- 
ied data, where people are attracted to the idea of ceding their genetic profile, and their 
information is increasingly accumulated. This makes evident how the arguments around 
NGS are embedded in “surveillance capitalism”’, as a general trend to achieve productive 
aims in neoliberal societies, where various participants are drawn toward their develop- 
ment, given the seductive nature of their features. As we have shown, this comes with a 
wide range of implications, not only in terms of how routine work is approached, but 
also in ethical, legal, and social considerations. 
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The conception of NGS as a “promissory entity” must be considered as it holds the 
power to guide the future of the field of forensic genetics in supporting criminal justice 
activities. As stated by Mike Fortun (2006, p. 156) “The rhetoric of the promise is every- 
where in genomics, and it’s all too easy and all too tempting to dismiss or overlook the 
real paradoxes of promising, and either take such statements at face value or dismiss them 
as ‘mere hype.’” As several studies in the sociology of expectations have shown, expec- 
tations hold “generative” and “performative” power, as they guide activities, provide 
structure and legitimation, attract interest and foster investment (Borup et al., 2006; 
Brown & Michael, 2003; Konrad, 2006). By posting a desired future for these “promis- 
sory entities”, and thereby mobilizing resources at the macro, meso, and micro levels, ex- 
pectations are able to make such futures real (Borup et al., 2006). Since these social 
expectations may influence the new directions that NGS application in forensic criminal 
identification is assuming, we point out the importance of “demythologizing” (Amorim, 
2012, p. 259) this technique as capable of producing objective, neutral, and irrefutable 
results to fight crime by opening its “black boxes.” Specifically, we draw attention to 
the risks of assuming that NGS results are considered objective because they are produced 
by automated techniques such as Big Data. As extensive literature (Browning & Arrigo, 
2021; Minocher & Randall, 2020; Sanders & Sheptycki, 2017) has already presented, data 
contains errors, and it is necessary to increase the transparency of these techniques (from 
their creators to their mode of use) in order to deconstruct these “black boxes” and to 
reduce their ethical and social consequences. 

As far as current usage is concerned, NGS technologies tend to be conceived as 
useful for generating intelligence, and not evidence. This is framed by a broader trend 
in forensic science, already inaugurated by other technologies, such as forensic DNA 
phenotyping and familial searches, which have been shifting the focus of forensic sci- 
ence from the construction of evidence to the generation of intelligence considered 
valuable to criminal investigations (Machado & Granja, 2020; Wienroth, 2018b). In 
addition, the expectations about the use of NGS are also dependent upon professionals’ 
access to resources and the type of work being conducted, the focus of which could be 
more centered either on research science or on forensic applied science. Professionals 
involved in research and who have access to a wide range of resources, owing to 
research funding and/or collaboration with equipment providers, tend to be more opti- 
mustic about NGS’s current and future applications. On the contrary, forensic geneti- 
cists who are more involved with routine casework tend to describe these techniques 
as a time-consuming and expensive process with arguably added benefits for routine 
casework. Expectations about NGS’s promising nature and the expected new possibil- 
ities associated with it can, therefore, be understood in relation to different locations of 
forensic geneticists within a developing network of innovation relationships (Machado 
& Granja, 2021). 
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Finally, it is important to consider the notion of responsibility. Nowadays, there is the 
prevalence of what some authors call the “actionable genome” (Houtman et al., 2021) 
that is leading to a “participatory turn” in forensic genetics (Granja, 2020). The wide- 
spread use of NGS technologies, each time at a lower cost for laboratories (clinical and 
medical or operating in the field of forensic services), makes the expansion of possibilities 
and choice enactment regarding the use of the human genome more evident. Houtman 
et al. (2021) argue that in present times “The genome is becoming actionable, meaning 
we can make decisions based on, or directly affecting, our personal genomic information. 
This is a result of recent developments in sequencing, interpreting, and editing the hu- 
man genome” (Houtman et al., 2021, p. 1). From the perspective of the authors, this 
new capacity to make decisions about our own genome in several domains—ranging 
from recreational purposes to biobanking-related aims—must imply a strong sense of in- 
dividual responsibility since decisions can also impact others and have wide-ranging im- 
plications in the future. Conversely, the authors also argue that genomic professionals 
equally hold the responsibility of enabling the process of opinion formation and informed 
deliberation. Considering the issue of individual responsibility, social responsibility, and 
the experts’ accountability in genomics, it can be argued that the deliberation about the 
“actionable genome” is often not facilitated because of the “diffusion of responsibility” 
and also because guidance in the decision-making process is often lacking. While uses 
of NGS technologies can be put into practice by different stakeholders, such as medical 
organizations, research consortia, commercial companies, and a variety of biobanks, not 
only are the boundaries between clinical, research, public health, and commercial appli- 
cations blurring, but also not all stakeholders are acting in the public’s best interest (Hout- 
man et al., 2021). 

The “actionable genome” therefore implies that individuals might, indeed, make de- 
cisions about the uses of their genomic data that will affect not only them, but also their 
families, communities, and society at large. Nowadays, such decisions go beyond their 
option to know more about their health or ancestry—which were the aims underlining 
the growth of the direct-to-consumer genetic testing market (Borry et al., 2010; Chow- 
White et al., 2018; Tutton, 2004). More recently, the “actionable genome” also entails 
the decision of whether to make their genetic data available to law enforcement searches. 
According to Granja (2020), this marks a “participatory turn” in forensic genetics. 
Contrary to what previously happened, when individuals involved in law enforcement 
searches had some type of involvement with the criminal justice system, nowadays indi- 
vidual citizens interested in personal genomics and who have already purchased a direct- 
to-consumer genetic test have the possibility of choosing to make their data available for 
law enforcement actions. 

This participatory turn is, nonetheless, still anchored in longstanding structures of po- 
wer and inequality. Suspects or those who have been convicted within the criminal jus- 
tice system, often from underprivileged and racialized communities, have their profiles 
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included in forensic DNA databases as mandatory (Hindmarsh & Prainsack, 2010), while 
consumers of recreational DNA databases, generally from economically privileged 
groups, have the option of choosing to make their profiles available to law enforcement 
altruistically (Granja, 2020). 


Conclusion 


The findings conclude that understanding ethical, legal, and social challenges, and 
ongoing monitoring of the applications of NGS in the criminal justice system is required. 
Practices related to data collection, storage, and use must be acceptable to gain the trust of 
the public. We argue that the perceived role of NGS for the future development of 
forensic genetics is at least partially explained by macrolevel societal trends toward “data- 
fication” of society (Cukier & Mayer-Schoenberger, 2013; Sadowski, 2019; Van Dijck, 
2014), expansion not only of massive surveillance in the digital world (Bellanova, 2017; 
Lyon, 2014), but also of surveillance capitalism (Lyon, 2017, 2019), anda trend toward an 
emphasis on the “actionable genome” (Houtman et al., 2021) that has led to a “‘partic- 
ipatory turn” in forensic genetics (Granja, 2020; Jasanoff, 2003). 

The analysis of ethical, legal, and social implications should be integrated into genetic 
research, with the participation of scientists who can anticipate and monitor the full range 
of possible applications of the research from the earliest stages. Problematizing these tech- 
niques must be complemented by understanding how they can be used in ways that 
minimize harmful societal impacts, such as discrimination, stigmatization, and weakening 
data protection and genetic privacy. These discussions must reach out to forensic experts 
and other professionals in the criminal justice system in order to broaden the understand- 
ing of the social implications of these recent advances in the field of forensic genetics. The 
design and implementation of research direct the ways in which results can be used, and 
data and technology, rather than ethical considerations or social needs, tend to drive the 
use of science in unintended ways. Here we argue that geneticists should anticipate the 
ethical, legal, and social issues associated with nonmedical applications of NGS technol- 
ogies. It will be necessary to develop comprehensive and standardized regulations to 
address these challenges. 

Considering that the producers of these technologies are rarely involved in debates 
about their routine use, their problematization of NGS techniques could help to open 
the “black boxes” of these techniques to foster the understanding of their challenges 
more deeply. Based on their knowledge, it might be possible to understand how auto- 
mation is working in NGS analyses to reduce the ethical, social, and cultural impact. 
It would also be necessary to undertake empirical analyses of the use of these techniques, 
to explore further: (1) How they could potentially be used and applied; (2) What are the 
proper rules about information access, storage, and security; and (3) What value they will 
have in identifying criminal suspects. Finally, it will be essential to create full awareness of 
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how NGS and their uses are different from previous DNA technologies in order to 
expose their potential to erode rights and expand surveillance capitalism. The current 
accelerated social trends around the implementation of new technologies such as NGS 
can be slowed down if, as a result of ethical, legal, and social debates, there is a call for 
reflection on questions such as: how necessary is the technology? How will it improve 
the fight against crime? 

The consideration of the ethical, legal, and social issues of genetics research and the 
development of technology are very frequently perceived to slow down science. How- 
ever, time spent making explicit the ever-present ethical, legal, and social issues and 
incorporating them into study design is better conceptualized as an integral part of the 
research process than as “extra” time, while also preventing serious obstacles that can 
also hamper or even halt research: absence of consideration of ethical issues can breed 
a deep distrust of scientists that can only hurt future efforts to carry out or raise funding 
for future research (Cho & Sankar, 2004). Finally, we argue for the urgent need for 
setting guidelines for international bioethics particularly oriented to NGS’s uses and other 
genomic technologies in the future. 
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